RAPIDS: Reproducible Analysis Pipeline for Data Streams Collected with Mobile Devices (Preprint)

Mapping Intimacies ◽

10.2196/preprints.23246 ◽

2020 ◽

Author(s):

Julio Vega ◽

Meng Li ◽

Kwesi Aguillera ◽

Nikunj Goel ◽

Echhit Joshi ◽

...

Keyword(s):

Data Streams ◽

Program Analysis ◽

Data Science ◽

Ground Truth ◽

Mobile Sensing ◽

Mobile Sensors ◽

Sensor Data ◽

Development Environment ◽

Ground Truth Data ◽

Reproducible Analysis

BACKGROUND Smartphone and wearable devices are widely used in behavioral and clinical research to collect longitudinal data that, along with ground truth data, are used to create models of human behavior. Mobile sensing researchers often program analysis code from scratch even though many research teams collect data from similar mobile sensors, platforms and devices. As a result, the quality of code varies, code is often not shared alongside publications, and when it is, it might not be stored on a version control system and most of the time there is no guarantee the development environment can be replicated. This makes it difficult for other scientists to read, reuse, audit, and reproduce a publication’s code and its results. OBJECTIVE We present RAPIDS, a reproducible pipeline to standardize the preprocessing, feature extraction, analysis, visualization, and reporting of data streams coming from mobile sensors. METHODS RAPIDS is formed by a group of R and Python scripts that are executed on top of reproducible virtual environments, orchestrated by Snakemake and organized following the cookiecutter data science project. Its development has been and will be informed by public discussions with the mobile sensing research community. RESULTS We share open source, documented, extensible and tested code to preprocess and extract behavioral features from data collected with the AWARE Framework in Android and iOS smartphones as well as Fitbit devices. We also provide a file structure and development environment that other researchers can follow to publish their own models, visualizations, and reports. CONCLUSIONS RAPIDS allows researchers to process mobile sensor data in a rigorous and reproducible way. This saves time and effort during the data analysis phase of a project and makes it easier to share an analysis workflow alongside publications.

Download Full-text

Reproducible Analysis Pipeline for Data Streams: Open-Source Software to Process Data Collected With Mobile Devices

Frontiers in Digital Health ◽

10.3389/fdgth.2021.769823 ◽

2021 ◽

Vol 3 ◽

Author(s):

Julio Vega ◽

Meng Li ◽

Kwesi Aguillera ◽

Nikunj Goel ◽

Echhit Joshi ◽

...

Keyword(s):

Open Source ◽

Data Streams ◽

Data Science ◽

Wearable Devices ◽

Mobile Sensors ◽

Sensor Data ◽

Process Data ◽

Analysis Pipeline ◽

Ground Truth Data ◽

Reproducible Analysis

Smartphone and wearable devices are widely used in behavioral and clinical research to collect longitudinal data that, along with ground truth data, are used to create models of human behavior. Mobile sensing researchers often program data processing and analysis code from scratch even though many research teams collect data from similar mobile sensors, platforms, and devices. This leads to significant inefficiency in not being able to replicate and build on others' work, inconsistency in quality of code and results, and lack of transparency when code is not shared alongside publications. We provide an overview of Reproducible Analysis Pipeline for Data Streams (RAPIDS), a reproducible pipeline to standardize the preprocessing, feature extraction, analysis, visualization, and reporting of data streams coming from mobile sensors. RAPIDS is formed by a group of R and Python scripts that are executed on top of reproducible virtual environments, orchestrated by a workflow management system, and organized following a consistent file structure for data science projects. We share open source, documented, extensible and tested code to preprocess, extract, and visualize behavioral features from data collected with any Android or iOS smartphone sensing app as well as Fitbit and Empatica wearable devices. RAPIDS allows researchers to process mobile sensor data in a rigorous and reproducible way. This saves time and effort during the data analysis phase of a project and facilitates sharing analysis workflows alongside publications.

Download Full-text

Evaluation of Commercial Truck Parking Detection for Rest Areas

Transportation Research Record Journal of the Transportation Research Board ◽

10.1177/0361198118788185 ◽

2018 ◽

Vol 2672 (9) ◽

pp. 141-151

Author(s):

Wei Sun ◽

Ethan Stoop ◽

Scott S. Washburn

Keyword(s):

Software Tool ◽

Vehicle Detection ◽

Ground Truth ◽

Early Morning ◽

Video Data ◽

Sensor Data ◽

Parking Space ◽

Ground Truth Data ◽

Detection Technology ◽

Rest Areas

Florida’s interstate rest areas are heavily utilized by commercial trucks for overnight parking. Many of these rest areas regularly experience 100% utilization of available commercial truck parking spaces during the evening and early-morning hours. Being able to communicate availability of commercial truck parking space to drivers in advance of arriving at a rest area would reduce unnecessary stops at full rest areas as well as driver anxiety. In order to do this, it is critical to implement a vehicle detection technology to reflect the parking status of the rest area correctly. The objective of this project was to evaluate three different wireless in-pavement vehicle detection technologies as applied to commercial truck parking at interstate rest areas. This paper mainly focuses on the following aspects: (a) accuracy of the vehicle detection in parking spaces, (b) installation, setup, and maintenance of the vehicle detection technology, and (c) truck parking trends at the rest area study site. The final project report includes a more detailed summary of the evaluation. The research team recorded video of the rest areas as the ground-truth data and developed a software tool to compare the video data with the parking sensor data. Two accuracy tests (event accuracy and occupancy accuracy) were conducted to evaluate each sensor’s ability to reflect the status of each parking space correctly. Overall, it was found that all three technologies performed well, with accuracy rates of 95% or better for both tests. This result suggests that, for implementation, pricing, and/or maintenance issues may be more significant factors for the choice of technology.

Download Full-text

On the potential and challenges of using machine-learning for automated quality control of environmental sensor data

10.5194/egusphere-egu2020-20777 ◽

2020 ◽

Author(s):

Lennart Schmidt ◽

Hannes Mollenhauer ◽

Corinna Rebmann ◽

David Schäfer ◽

Antje Claussnitzer ◽

...

Keyword(s):

Machine Learning ◽

Quality Control ◽

Ground Truth ◽

Sensor Data ◽

Small Scale ◽

Ground Truth Data ◽

Starting Point ◽

Environmental Sensor ◽

Spatio Temporal ◽

Automated Quality Control

<p>With more and more data being gathered from environmental sensor networks, the importance of automated quality-control (QC) routines to provide usable data in near-real time is becoming increasingly apparent. Machine-learning (ML) algorithms exhibit a high potential to this respect as they are able to exploit the spatio-temporal relation of multiple sensors to identify anomalies while allowing for non-linear functional relations in the data. In this study, we evaluate the potential of ML for automated QC on two spatio-temporal datasets at different spatial scales: One is a dataset of atmospheric variables at 53 stations across Northern Germany. The second dataset contains timeseries of soil moisture and temperature at 40 sensors at a small-scale measurement plot.</p><p>Furthermore, we investigate strategies to tackle three challenges that are commonly present when applying ML for QC: 1) As sensors might drop out, the ML models have to be designed to be robust against missing values in the input data. We address this by comparing different data imputation methods, coupled with a binary representation of whether a value is missing or not. 2) Quality flags that mark erroneous data points to serve as ground truth for model training might not be available. And 3) There is no guarantee that the system under study is stationary, which might render the outputs of a trained model useless in the future. To address 2) and 3), we frame the problem both as a supervised and unsupervised learning problem. Here, the use of unsupervised ML-models can be beneficial as they do not require ground truth data and can thus be retrained more easily should the system be subject to significant changes. In this presentation, we discuss the performance, advantages and drawbacks of the proposed strategies to tackle the aforementioned challenges. Thus, we provide a starting point for researchers in the largely untouched field of ML application for automated quality control of environmental sensor data.</p>

Download Full-text

The Zurich urban micro aerial vehicle dataset

The International Journal of Robotics Research ◽

10.1177/0278364917702237 ◽

2017 ◽

Vol 36 (3) ◽

pp. 269-273 ◽

Cited By ~ 30

Author(s):

András L Majdik ◽

Charles Till ◽

Davide Scaramuzza

Keyword(s):

Three Dimensional ◽

Ground Truth ◽

Ground Level ◽

Urban Environments ◽

Sensor Data ◽

Measurement Unit ◽

Reconstruction Algorithms ◽

Ground Truth Data ◽

Micro Aerial Vehicle ◽

Aerial Vehicle

This paper presents a dataset recorded on-board a camera-equipped micro aerial vehicle flying within the urban streets of Zurich, Switzerland, at low altitudes (i.e. 5–15 m above the ground). The 2 km dataset consists of time synchronized aerial high-resolution images, global position system and inertial measurement unit sensor data, ground-level street view images, and ground truth data. The dataset is ideal to evaluate and benchmark appearance-based localization, monocular visual odometry, simultaneous localization and mapping, and online three-dimensional reconstruction algorithms for micro aerial vehicles in urban environments.

Download Full-text

Leveraging Mobile Sensing and Machine Learning for Personalized Mental Health Care

Ergonomics in Design The Quarterly of Human Factors Applications ◽

10.1177/1064804620920494 ◽

2020 ◽

Vol 28 (4) ◽

pp. 18-23

Author(s):

Mehdi Boukhechba ◽

Anna N. Baglione ◽

Laura E. Barnes

Keyword(s):

Mental Health ◽

Health Care ◽

Mental Illness ◽

Health Care Systems ◽

Ground Truth ◽

Mobile Sensing ◽

Sensor Data ◽

Health State ◽

Care Systems ◽

Mental Health Interventions

Mental illness is widespread in our society, yet remains difficult to treat due to challenges such as stigma and overburdened health care systems. New paradigms are needed for treating mental illness outside the practitioner’s office. We propose a framework to guide the design of mobile sensing systems for personalized mental health interventions. This framework guides researchers in constructing interventions from the ground up through four phases: sensor data collection, digital biomarker extraction, health state detection, and intervention deployment. We highlight how this framework advances research in personalized mHealth and address remaining challenges, such as ground truth fidelity and missing data.

Download Full-text

Predicting Venue Popularity Using Crowd-Sourced and Passive Sensor Data

Smart Cities ◽

10.3390/smartcities3030042 ◽

2020 ◽

Vol 3 (3) ◽

pp. 818-841

Author(s):

Stanislav Timokhin ◽

Mohammad Sadrani ◽

Constantinos Antoniou

Keyword(s):

Auxiliary Information ◽

Ground Truth ◽

Spatiotemporal Data ◽

Sensor Data ◽

Mobility Patterns ◽

Ground Truth Data ◽

Dependent Variables ◽

Passive Sensor ◽

Location Based Social Networks

Efficient and reliable mobility pattern identification is essential for transport planning research. In order to infer mobility patterns, however, a large amount of spatiotemporal data is needed, which is not always available. Hence, location-based social networks (LBSNs) have received considerable attention as a potential data provider. The aim of this study is to investigate the possibility of using several different auxiliary information sources for venue popularity modeling and provide an alternative venue popularity measuring approach. Initially, data from widely used services, such as Google Maps, Yelp and OpenStreetMap (OSM), are used to model venue popularity. To estimate hourly venue occupancy, two different classes of model are used, including linear regression with lasso regularization and gradient boosted regression (GBR). The predictions are made based on venue-related parameters (e.g., rating, comments) and locational properties (e.g., stores, hotels, attractions). Results show that the prediction can be improved using GBR with a logarithmic transformation of the dependent variables. To investigate the quality of social media-based models by obtaining WiFi-based ground truth data, a microcontroller setup is developed to measure the actual number of people attending venues using WiFi presence detection, demonstrating that the similarity between the results of WiFi data collection and Google “Popular Times” is relatively promising.

Download Full-text

The Study of Fresh-Water Lake Ice using Multiplexed Imaging Radar

Journal of Glaciology ◽

10.1017/s0022143000034596 ◽

1975 ◽

Vol 15 (73) ◽

pp. 461 ◽

Cited By ~ 3

Author(s):

M. Leonard Bryan ◽

R. W. Larson

Keyword(s):

Great Lakes ◽

Ground Truth ◽

Field Work ◽

Sensor Data ◽

Ice Thickness ◽

Remotely Sensed Data ◽

Radar Backscatter ◽

Upper Great Lakes ◽

Ground Truth Data ◽

Adverse Weather

The study of ice in the upper Great Lakes, both from the operational and the scientific points of view, is receiving continued attention. Both quantitative and qualitative field work is being conducted to provide the needed background for accurate interpretation of remotely sensed data. The sensor data under discussion in this paper were obtained by a side-looking multiplexed airborne radar (SLAR). These were supplemented with ground-truth data. Radar, due to its ability to penetrate adverse weather, is an especially important instrument for monitoring ice in the upper Great Lakes. It has been previously shown that imaging radars can provide maps of ice cover in these areas. However, questions concerning both the nature of the surfaces reflecting radar energy and the interpretation of the radar imagery continually arise. Our analysis office in Whitefish Bay (Lake Superior) indicated that the combination of the ice/water interface with the ice/air interface is the major contributor to the radar backscatter as seen on the imagery. The ice has a very low dielectric constant (<3.0) and a low loss tangent. Thus, this ice is somewhat transparent to the energy used by the imaging SLAR system. The ice types studied include newly formed black ice, pancake ice, and frozen and consolidated pack and brash ice. Although ice thickness cannot be measured directly from the received signals, it is suspected that by combining the radar backscatter information with both meteorological and sea-state history of the area and with some basic ground truth, better estimates of the ice thickness may be provided. In addition, certain ice features (e.g. ridges, ice foots, areas of brash ice) may be identified with reasonable confidence. There is a continued need for additional ground work to verify the validity of imaging radars for these types of interpretations. This paper is being published in full in another issue of the Journal of Glaciology.

Download Full-text

Land Surface Temperature Retrieval from LANDSAT-8 Ther-mal Infrared Sensor Data and Validation with Infrared Ther-mometer Camera

International Journal of Engineering & Technology ◽

10.14419/ijet.v7i4.20.27418 ◽

2018 ◽

Vol 7 (4.20) ◽

pp. 601

Author(s):

Muhammad Mejbel Salih ◽

Oday Zakariya Jasim ◽

Khalid I. Hassoon ◽

Aysar Jameel Abdalkadhum

Keyword(s):

Land Cover ◽

Surface Temperature ◽

Land Surface Temperature ◽

Land Surface ◽

Ground Truth ◽

Sensor Data ◽

Landsat 8 ◽

Infrared Sensor ◽

Ground Truth Data ◽

Land Cover Types

This paper illustrates a proposed method for the retrieval of land surface temperature (LST) from the two thermal bands of the LAND-SAT-8 data. LANDSAT-8, the latest satellite from Landsat series, launched on 11 February 2013, using LANDSAT-8 Operational Line Imager and Thermal Infrared Sensor (OLI & TIRS) satellite data. LANDSAT-8 medium spatial resolution multispectral imagery presents particular interest in extracting land cover, because of the fine spectral resolution, the radiometric quantization of 12 bits. In this search a trial has been made to estimate LST over Al-Hashimiya district, south of Babylon province, middle of Iraq. Two dates images acquired on 2nd &18th of March 2018 to retrieve LST and compare them with ground truth data from infrared thermometer camera (all the meas-urements contacted with target by using type-k thermocouple) at the same time of images capture. The results showed that the rivers had a higher LST which is different to the other land cover types, of less than 3.47 C ◦, and the LST different for vegetation and residential area were less than 0.4 C ◦ with correlation coefficient of the two bands 10 and 11 Rbnad10= 0.70, Rband11 = 0.89 respectively, for the im-aged acquired on the 2nd of march 2018 and Rband10= 0.70 and Rband11 = 0.72 on the 18th of march 2018. These results confirm that the proposed approach is effective for the retrieval of LST from the LANDSAT-8 Thermal bands, and the IR thermometer camera data which is an effective way to validate and improve the performance of LST retrieval. Generally the results show that the closer measure-ment taken from the scene center time, a better quality to classify the land cover. The purpose of this study is to assess the use of LAND-SAT-8 data to specify temperature differences in land cover and compare the relationship between land surface temperature and land cover types.

Download Full-text

The Study of Fresh-Water Lake Ice using Multiplexed Imaging Radar

Journal of Glaciology ◽

10.3189/s0022143000034596 ◽

1975 ◽

Vol 15 (73) ◽

pp. 461-461

Author(s):

M. Leonard Bryan ◽

R. W. Larson

Keyword(s):

Great Lakes ◽

Ground Truth ◽

Field Work ◽

Sensor Data ◽

Ice Thickness ◽

Remotely Sensed Data ◽

Radar Backscatter ◽

Upper Great Lakes ◽

Ground Truth Data ◽

Adverse Weather

The study of ice in the upper Great Lakes, both from the operational and the scientific points of view, is receiving continued attention. Both quantitative and qualitative field work is being conducted to provide the needed background for accurate interpretation of remotely sensed data. The sensor data under discussion in this paper were obtained by a side-looking multiplexed airborne radar (SLAR). These were supplemented with ground-truth data.Radar, due to its ability to penetrate adverse weather, is an especially important instrument for monitoring ice in the upper Great Lakes. It has been previously shown that imaging radars can provide maps of ice cover in these areas. However, questions concerning both the nature of the surfaces reflecting radar energy and the interpretation of the radar imagery continually arise.Our analysis office in Whitefish Bay (Lake Superior) indicated that the combination of the ice/water interface with the ice/air interface is the major contributor to the radar backscatter as seen on the imagery. The ice has a very low dielectric constant (<3.0) and a low loss tangent. Thus, this ice is somewhat transparent to the energy used by the imaging SLAR system. The ice types studied include newly formed black ice, pancake ice, and frozen and consolidated pack and brash ice.Although ice thickness cannot be measured directly from the received signals, it is suspected that by combining the radar backscatter information with both meteorological and sea-state history of the area and with some basic ground truth, better estimates of the ice thickness may be provided. In addition, certain ice features (e.g. ridges, ice foots, areas of brash ice) may be identified with reasonable confidence. There is a continued need for additional ground work to verify the validity of imaging radars for these types of interpretations.This paper is being published in full in another issue of the Journal of Glaciology.

Download Full-text

Hindcasting violent events in Colombia using Internet data

Digital Government: Research and Practice (DGOV) ◽

10.1145/3462211 ◽

2021 ◽

Author(s):

Ashlynn Daughton ◽

Sara Y. Del Valle ◽

Chrysm Watson Ross ◽

Geoffrey Fairchild

Keyword(s):

Civil Wars ◽

Data Streams ◽

Ground Truth ◽

Left Wing ◽

Ground Truth Data ◽

News Sources ◽

Monitoring And Forecasting ◽

Armed Violence ◽

The Government ◽

Google Search

Colombia experienced a decades-long civil war between the government and many left-wing guerrilla groups. It was marked by violence, kidnappings, and large quantities of human displacement. Monitoring and forecasting civil wars are important to mitigate their potential impact but require access to ground truth data. We examine the use of Internet data streams, namely Google search queries, tweets related to politics, and traditional news sources to retrospectively forecast (i.e., hindcast) state-based armed violence in Colombia. We compare the results of statistical models using three combinations of these features to evaluate the predictive capabilities of each data stream. Our results show that the combination of internet and traditional news data models perform most consistently, though Internet-only is surprisingly promising. Overall, we are able to produce high-quality models hindcasting the presence or absence of state-based armed violence in Colombia up to 6 months in advance. These results support the use of exogenous data streams to forecast evolving situations around the globe.

Download Full-text