scholarly journals Predicting Venue Popularity Using Crowd-Sourced and Passive Sensor Data

Smart Cities ◽  
2020 ◽  
Vol 3 (3) ◽  
pp. 818-841
Author(s):  
Stanislav Timokhin ◽  
Mohammad Sadrani ◽  
Constantinos Antoniou

Efficient and reliable mobility pattern identification is essential for transport planning research. In order to infer mobility patterns, however, a large amount of spatiotemporal data is needed, which is not always available. Hence, location-based social networks (LBSNs) have received considerable attention as a potential data provider. The aim of this study is to investigate the possibility of using several different auxiliary information sources for venue popularity modeling and provide an alternative venue popularity measuring approach. Initially, data from widely used services, such as Google Maps, Yelp and OpenStreetMap (OSM), are used to model venue popularity. To estimate hourly venue occupancy, two different classes of model are used, including linear regression with lasso regularization and gradient boosted regression (GBR). The predictions are made based on venue-related parameters (e.g., rating, comments) and locational properties (e.g., stores, hotels, attractions). Results show that the prediction can be improved using GBR with a logarithmic transformation of the dependent variables. To investigate the quality of social media-based models by obtaining WiFi-based ground truth data, a microcontroller setup is developed to measure the actual number of people attending venues using WiFi presence detection, demonstrating that the similarity between the results of WiFi data collection and Google “Popular Times” is relatively promising.

Author(s):  
Wei Sun ◽  
Ethan Stoop ◽  
Scott S. Washburn

Florida’s interstate rest areas are heavily utilized by commercial trucks for overnight parking. Many of these rest areas regularly experience 100% utilization of available commercial truck parking spaces during the evening and early-morning hours. Being able to communicate availability of commercial truck parking space to drivers in advance of arriving at a rest area would reduce unnecessary stops at full rest areas as well as driver anxiety. In order to do this, it is critical to implement a vehicle detection technology to reflect the parking status of the rest area correctly. The objective of this project was to evaluate three different wireless in-pavement vehicle detection technologies as applied to commercial truck parking at interstate rest areas. This paper mainly focuses on the following aspects: (a) accuracy of the vehicle detection in parking spaces, (b) installation, setup, and maintenance of the vehicle detection technology, and (c) truck parking trends at the rest area study site. The final project report includes a more detailed summary of the evaluation. The research team recorded video of the rest areas as the ground-truth data and developed a software tool to compare the video data with the parking sensor data. Two accuracy tests (event accuracy and occupancy accuracy) were conducted to evaluate each sensor’s ability to reflect the status of each parking space correctly. Overall, it was found that all three technologies performed well, with accuracy rates of 95% or better for both tests. This result suggests that, for implementation, pricing, and/or maintenance issues may be more significant factors for the choice of technology.


Author(s):  
Zhihan Fang ◽  
Yu Yang ◽  
Guang Yang ◽  
Yikuan Xian ◽  
Fan Zhang ◽  
...  

Data from the cellular network have been proved as one of the most promising way to understand large-scale human mobility for various ubiquitous computing applications due to the high penetration of cellphones and low collection cost. Existing mobility models driven by cellular network data suffer from sparse spatial-temporal observations because user locations are recorded with cellphone activities, e.g., calls, text, or internet access. In this paper, we design a human mobility recovery system called CellSense to take the sparse cellular billing data (CBR) as input and outputs dense continuous records to recover the sensing gap when using cellular networks as sensing systems to sense the human mobility. There is limited work on this kind of recovery systems at large scale because even though it is straightforward to design a recovery system based on regression models, it is very challenging to evaluate these models at large scale due to the lack of the ground truth data. In this paper, we explore a new opportunity based on the upgrade of cellular infrastructures to obtain cellular network signaling data as the ground truth data, which log the interaction between cellphones and cellular towers at signal levels (e.g., attaching, detaching, paging) even without billable activities. Based on the signaling data, we design a system CellSense for human mobility recovery by integrating collective mobility patterns with individual mobility modeling, which achieves the 35.3% improvement over the state-of-the-art models. The key application of our recovery model is to take regular sparse CBR data that a researcher already has, and to recover the missing data due to sensing gaps of CBR data to produce a dense cellular data for them to train a machine learning model for their use cases, e.g., next location prediction.


2020 ◽  
Author(s):  
Lennart Schmidt ◽  
Hannes Mollenhauer ◽  
Corinna Rebmann ◽  
David Schäfer ◽  
Antje Claussnitzer ◽  
...  

<p>With more and more data being gathered from environmental sensor networks, the importance of automated quality-control (QC) routines to provide usable data in near-real time is becoming increasingly apparent. Machine-learning (ML) algorithms exhibit a high potential to this respect as they are able to exploit the spatio-temporal relation of multiple sensors to identify anomalies while allowing for non-linear functional relations in the data. In this study, we evaluate the potential of ML for automated QC on two spatio-temporal datasets at different spatial scales: One is a dataset of atmospheric variables at 53 stations across Northern Germany. The second dataset contains timeseries of soil moisture and temperature at 40 sensors at a small-scale measurement plot.</p><p>Furthermore, we investigate strategies to tackle three challenges that are commonly present when applying ML for QC: 1) As sensors might drop out, the ML models have to be designed to be robust against missing values in the input data. We address this by comparing different data imputation methods, coupled with a binary representation of whether a value is missing or not. 2) Quality flags that mark erroneous data points to serve as ground truth for model training might not be available. And 3) There is no guarantee that the system under study is stationary, which might render the outputs of a trained model useless in the future. To address 2) and 3), we frame the problem both as a supervised and unsupervised learning problem. Here, the use of unsupervised ML-models can be beneficial as they do not require ground truth data and can thus be retrained more easily should the system be subject to significant changes. In this presentation, we discuss the performance, advantages and drawbacks of the proposed strategies to tackle the aforementioned challenges. Thus, we provide a starting point for researchers in the largely untouched field of ML application for automated quality control of environmental sensor data.</p>


Author(s):  
Christopher Toth ◽  
Wonho Suh ◽  
Vetri Elango ◽  
Ramik Sadana ◽  
Angshuman Guin ◽  
...  

Basic traffic counts are among the key elements in transportation planning and forecasting. As emerging data collection technologies proliferate, the availability of traffic count data will expand by orders of magnitude. However, availability of data does not always guarantee data accuracy, and it is essential that observed data are compared with ground truth data. Little research or guidance is available that ensures the quality of ground truth data with which the count results of automated technologies can be compared. To address the issue of ground truth data based on manual counts, a manual traffic counting application was developed for an Android tablet. Unlike other manual count applications, this application allows data collectors to replay and toggle through the video in supervisory mode to review and correct counts made in the first pass. For system verification, the review function of the application was used to count and recount freeway traffic in videos from the Atlanta, Georgia, metropolitan area. Initial counts and reviewed counts were compared, and improvements in count accuracy were assessed. The results indicated the benefit of the review process and suggested that this application could minimize human error and provide more accurate ground truth traffic count data for use in transportation planning applications and for model verification.


Semantic Web ◽  
2020 ◽  
pp. 1-19
Author(s):  
Anca Dumitrache ◽  
Oana Inel ◽  
Benjamin Timmermans ◽  
Carlos Ortiz ◽  
Robert-Jan Sips ◽  
...  

The process of gathering ground truth data through human annotation is a major bottleneck in the use of information extraction methods for populating the Semantic Web. Crowdsourcing-based approaches are gaining popularity in the attempt to solve the issues related to volume of data and lack of annotators. Typically these practices use inter-annotator agreement as a measure of quality. However, in many domains, such as event detection, there is ambiguity in the data, as well as a multitude of perspectives of the information examples. We present an empirically derived methodology for efficiently gathering of ground truth data in a diverse set of use cases covering a variety of domains and annotation tasks. Central to our approach is the use of CrowdTruth metrics that capture inter-annotator disagreement. We show that measuring disagreement is essential for acquiring a high quality ground truth. We achieve this by comparing the quality of the data aggregated with CrowdTruth metrics with majority vote, over a set of diverse crowdsourcing tasks: Medical Relation Extraction, Twitter Event Identification, News Event Extraction and Sound Interpretation. We also show that an increased number of crowd workers leads to growth and stabilization in the quality of annotations, going against the usual practice of employing a small number of annotators.


2020 ◽  
Vol 24 ◽  
pp. 63-86
Author(s):  
Francisco Mena ◽  
Ricardo Ñanculef ◽  
Carlos Valle

The lack of annotated data is one of the major barriers facing machine learning applications today. Learning from crowds, i.e. collecting ground-truth data from multiple inexpensive annotators, has become a common method to cope with this issue. It has been recently shown that modeling the varying quality of the annotations obtained in this way, is fundamental to obtain satisfactory performance in tasks where inexpert annotators may represent the majority but not the most trusted group. Unfortunately, existing techniques represent annotation patterns for each annotator individually, making the models difficult to estimate in large-scale scenarios. In this paper, we present two models to address these problems. Both methods are based on the hypothesis that it is possible to learn collective annotation patterns by introducing confusion matrices that involve groups of data point annotations or annotators. The first approach clusters data points with a common annotation pattern, regardless the annotators from which the labels have been obtained. Implicitly, this method attributes annotation mistakes to the complexity of the data itself and not to the variable behavior of the annotators. The second approach explicitly maps annotators to latent groups that are collectively parametrized to learn a common annotation pattern. Our experimental results show that, compared with other methods for learning from crowds, both methods have advantages in scenarios with a large number of annotators and a small number of annotations per annotator.


2016 ◽  
Vol 5 (2) ◽  
pp. 121-129 ◽  
Author(s):  
Angel J. Lopez ◽  
Ivana Semanjski ◽  
Dominique Gillis ◽  
Daniel Ochoa ◽  
Sidharta Gautama

Traditional travel survey methods have been widely used for collecting information about urban mobility although Global Position System (GPS) has become an automatic option for collecting more precise data of the households since mid-1990s. Many studies on mobility patterns have focused on the GPS advantages leaving aside its issues such as the quality of the data collected. However, when it comes to extract the frequency of the trips and travelled distance, this technology faces some gaps due to the related issues such as signal reception and time-to-first-fix location that turns out in missing observations and respectively unrecognised or over-segmented trips. In this study, we focus on two aspects of GPS data for a car-mode, (i) measurement of the gaps in the travelled distance and (ii) estimation of the travelled distance and the factors that influence the GPS gaps. To asses that, GPS tracks are compared to a ground truth source. Additionally, the trips are analysed based on the land use (e.g. urban and rural areas) and length (e.g. short, medium and long trips). Results from 170 participants and more than a year of GPStracking show that around 9 % of the travelled distance is not captured by GPS and it affects more short trips than long ones. Moreover, we validate the importance of the time spent on the user activity and the land use as factors that influence the gaps in GPS.


2021 ◽  
Author(s):  
Michael Tarasiou

This paper presents DeepSatData a pipeline for automatically generating satellite imagery datasets for training machine learning models. We also discuss design considerations with emphasis on dense classification tasks, e.g. semantic segmentation. The implementation presented makes use of freely available Sentinel-2 data which allows the generation of large scale datasets required for training deep neural networks (DNN). We discuss issues faced from the point of view of DNN training and evaluation such as checking the quality of ground truth data and comment on the scalability of the approach.


2018 ◽  
Author(s):  
Naihui Zhou ◽  
Zachary D Siegel ◽  
Scott Zarecor ◽  
Nigel Lee ◽  
Darwin A Campbell ◽  
...  

AbstractThe accuracy of machine learning tasks critically depends on high quality ground truth data. Therefore, in many cases, producing good ground truth data typically involves trained professionals; however, this can be costly in time, effort, and money. Here we explore the use of crowdsourcing to generate a large number of training data of good quality. We explore an image analysis task involving the segmentation of corn tassels from images taken in a field setting. We investigate the accuracy, speed and other quality metrics when this task is performed by students for academic credit, Amazon MTurk workers, and Master Amazon MTurk workers. We conclude that the Amazon MTurk and Master Mturk workers perform significantly better than the for-credit students, but with no significant difference between the two MTurk worker types. Furthermore, the quality of the segmentation produced by Amazon MTurk workers rivals that of an expert worker. We provide best practices to assess the quality of ground truth data, and to compare data quality produced by different sources. We conclude that properly managed crowdsourcing can be used to establish large volumes of viable ground truth data at a low cost and high quality, especially in the context of high throughput plant phenotyping. We also provide several metrics for assessing the quality of the generated datasets.Author SummaryFood security is a growing global concern. Farmers, plant breeders, and geneticists are hastening to address the challenges presented to agriculture by climate change, dwindling arable land, and population growth. Scientists in the field of plant phenomics are using satellite and drone images to understand how crops respond to a changing environment and to combine genetics and environmental measures to maximize crop growth efficiency. However, the terabytes of image data require new computational methods to extract useful information. Machine learning algorithms are effective in recognizing select parts of images, butthey require high quality data curated by people to train them, a process that can be laborious and costly. We examined how well crowdsourcing works in providing training data for plant phenomics, specifically, segmenting a corn tassel – the male flower of the corn plant – from the often-cluttered images of a cornfield. We provided images to students, and to Amazon MTurkers, the latter being an on-demand workforce brokered by Amazon.com and paid on a task-by-task basis. We report on best practices in crowdsourcing image labeling for phenomics, and compare the different groups on measures such as fatigue and accuracy over time. We find that crowdsourcing is a good way of generating quality labeled data, rivaling that of experts.


2022 ◽  
Vol 12 (1) ◽  
Author(s):  
Koustuv Saha ◽  
Asra Yousuf ◽  
Ryan L. Boyd ◽  
James W. Pennebaker ◽  
Munmun De Choudhury

AbstractThe mental health of college students is a growing concern, and gauging the mental health needs of college students is difficult to assess in real-time and in scale. To address this gap, researchers and practitioners have encouraged the use of passive technologies. Social media is one such "passive sensor" that has shown potential as a viable "passive sensor" of mental health. However, the construct validity and in-practice reliability of computational assessments of mental health constructs with social media data remain largely unexplored. Towards this goal, we study how assessing the mental health of college students using social media data correspond with ground-truth data of on-campus mental health consultations. For a large U.S. public university, we obtained ground-truth data of on-campus mental health consultations between 2011–2016, and collected 66,000 posts from the university’s Reddit community. We adopted machine learning and natural language methodologies to measure symptomatic mental health expressions of depression, anxiety, stress, suicidal ideation, and psychosis on the social media data. Seasonal auto-regressive integrated moving average (SARIMA) models of forecasting on-campus mental health consultations showed that incorporating social media data led to predictions with r = 0.86 and SMAPE = 13.30, outperforming models without social media data by 41%. Our language analyses revealed that social media discussions during high mental health consultations months consisted of discussions on academics and career, whereas months of low mental health consultations saliently show expressions of positive affect, collective identity, and socialization. This study reveals that social media data can improve our understanding of college students’ mental health, particularly their mental health treatment needs.


Sign in / Sign up

Export Citation Format

Share Document