scholarly journals Leveraging Road Characteristics and Contributor Behaviour for Assessing Road Type Quality in OSM

2021 ◽  
Vol 10 (7) ◽  
pp. 436
Author(s):  
Amerah Alghanim ◽  
Musfira Jilani ◽  
Michela Bertolotto ◽  
Gavin McArdle

Volunteered Geographic Information (VGI) is often collected by non-expert users. This raises concerns about the quality and veracity of such data. There has been much effort to understand and quantify the quality of VGI. Extrinsic measures which compare VGI to authoritative data sources such as National Mapping Agencies are common but the cost and slow update frequency of such data hinder the task. On the other hand, intrinsic measures which compare the data to heuristics or models built from the VGI data are becoming increasingly popular. Supervised machine learning techniques are particularly suitable for intrinsic measures of quality where they can infer and predict the properties of spatial data. In this article we are interested in assessing the quality of semantic information, such as the road type, associated with data in OpenStreetMap (OSM). We have developed a machine learning approach which utilises new intrinsic input features collected from the VGI dataset. Specifically, using our proposed novel approach we obtained an average classification accuracy of 84.12%. This result outperforms existing techniques on the same semantic inference task. The trustworthiness of the data used for developing and training machine learning models is important. To address this issue we have also developed a new measure for this using direct and indirect characteristics of OSM data such as its edit history along with an assessment of the users who contributed the data. An evaluation of the impact of data determined to be trustworthy within the machine learning model shows that the trusted data collected with the new approach improves the prediction accuracy of our machine learning technique. Specifically, our results demonstrate that the classification accuracy of our developed model is 87.75% when applied to a trusted dataset and 57.98% when applied to an untrusted dataset. Consequently, such results can be used to assess the quality of OSM and suggest improvements to the data set.

2020 ◽  
Vol 10 (2) ◽  
pp. 1-26
Author(s):  
Naghmeh Moradpoor Sheykhkanloo ◽  
Adam Hall

An insider threat can take on many forms and fall under different categories. This includes malicious insider, careless/unaware/uneducated/naïve employee, and the third-party contractor. Machine learning techniques have been studied in published literature as a promising solution for such threats. However, they can be biased and/or inaccurate when the associated dataset is hugely imbalanced. Therefore, this article addresses the insider threat detection on an extremely imbalanced dataset which includes employing a popular balancing technique known as spread subsample. The results show that although balancing the dataset using this technique did not improve performance metrics, it did improve the time taken to build the model and the time taken to test the model. Additionally, the authors realised that running the chosen classifiers with parameters other than the default ones has an impact on both balanced and imbalanced scenarios, but the impact is significantly stronger when using the imbalanced dataset.


Missing data arise major issues in the large database regarding quantitative analysis. Due to this issues, the inference of the computational process produce bias results, more damage of data, the error rate can increase, and more difficult to accomplish the process of imputation. Prediction of disguised missing data occurs in the large data sets are another major problems in real time operation. Machine learning (ML) techniques to connect with the classification of measurement to enforce the accuracy rate of predictive values. These techniques overcome the various challenges to the problem of losing data. Recent work based on the prediction of misclassification using supervised ML approach; to predict an output for an unseen input with limited parameters in a data set. When increase the size of parameter, then it generates the outcome of less accuracy rate. This article presented a new approach COBACO, an effective supervised machine learning technique. Several strategies describe the classification of predictive techniques for missing data analysis in efficient supervised machine learning techniques. The proposed predictive techniques COBACO generated more precise, accurate results than the other predictive approaches. The Experimental results obtained using both real and synthetic data set show that the proposed approach offers a valuable and promising insight to the problem of prediction of missing information.


2019 ◽  
pp. 469-487
Author(s):  
Musfira Jilani ◽  
Michela Bertolotto ◽  
Padraig Corcoran ◽  
Amerah Alghanim

Nowadays an ever-increasing number of applications require complete and up-to-date spatial data, in particular maps. However, mapping is an expensive process and the vastness and dynamics of our world usually render centralized and authoritative maps outdated and incomplete. In this context crowd-sourced maps have the potential to provide a complete, up-to-date, and free representation of our world. However, the proliferation of such maps largely remains limited due to concerns about their data quality. While most of the current data quality assessment mechanisms for such maps require referencing to authoritative maps, we argue that such referencing of a crowd-sourced spatial database is ineffective. Instead we focus on the use of machine learning techniques that we believe have the potential to not only allow the assessment but also to recommend the improvement of the quality of crowd-sourced maps without referencing to external databases. This chapter gives an overview of these approaches.


2020 ◽  
Author(s):  
Cecilia Contreras ◽  
Mahdi Khodadadzadeh ◽  
Laura Tusa ◽  
Richard Gloaguen

<p>Drilling is a key task in exploration campaigns to characterize mineral deposits at depth. Drillcores<br>are first logged in the field by a geologist and with regards to, e.g., mineral assemblages,<br>alteration patterns, and structural features. The core-logging information is then used to<br>locate and target the important ore accumulations and select representative samples that are<br>further analyzed by laboratory measurements (e.g., Scanning Electron Microscopy (SEM), Xray<br>diffraction (XRD), X-ray Fluorescence (XRF)). However, core-logging is a laborious task and<br>subject to the expertise of the geologist.<br>Hyperspectral imaging is a non-invasive and non-destructive technique that is increasingly<br>being used to support the geologist in the analysis of drill-core samples. Nonetheless, the<br>benefit and impact of using hyperspectral data depend on the applied methods. With this in<br>mind, machine learning techniques, which have been applied in different research fields,<br>provide useful tools for an advance and more automatic analysis of the data. Lately, machine<br>learning frameworks are also being implemented for mapping minerals in drill-core<br>hyperspectral data.<br>In this context, this work follows an approach to map minerals on drill-core hyperspectral data<br>using supervised machine learning techniques, in which SEM data, integrated with the mineral<br>liberation analysis (MLA) software, are used in training a classifier. More specifically, the highresolution<br>mineralogical data obtained by SEM-MLA analysis is resampled and co-registered<br>to the hyperspectral data to generate a training set. Due to the large difference in spatial<br>resolution between the SEM-MLA and hyperspectral images, a pre-labeling strategy is<br>required to link these two images at the hyperspectral data spatial resolution. In this study,<br>we use the SEM-MLA image to compute the abundances of minerals for each hyperspectral<br>pixel in the corresponding SEM-MLA region. We then use the abundances as features in a<br>clustering procedure to generate the training labels. In the final step, the generated training<br>set is fed into a supervised classification technique for the mineral mapping over a large area<br>of a drill-core. The experiments are carried out on a visible to near-infrared (VNIR) and shortwave<br>infrared (SWIR) hyperspectral data set and based on preliminary tests the mineral<br>mapping task improves significantly.</p>


Data Science in healthcare is a innovative and capable for industry implementing the data science applications. Data analytics is recent science in to discover the medical data set to explore and discover the disease. It’s a beginning attempt to identify the disease with the help of large amount of medical dataset. Using this data science methodology, it makes the user to find their disease without the help of health care centres. Healthcare and data science are often linked through finances as the industry attempts to reduce its expenses with the help of large amounts of data. Data science and medicine are rapidly developing, and it is important that they advance together. Health care information is very effective in the society. In a human life day to day heart disease had increased. Based on the heart disease to monitor different factors in human body to analyse and prevent the heart disease. To classify the factors using the machine learning algorithms and to predict the disease is major part. Major part of involves machine level based supervised learning algorithm such as SVM, Naviebayes, Decision Trees and Random forest.


Author(s):  
Marko Pregeljc ◽  
Erik Štrumbelj ◽  
Miran Mihelcic ◽  
Igor Kononenko

The authors employed traditional and novel machine learning to improve insight into the connections between the quality of an organization of enterprises as a type of formal social units and the results of enterprises’ performance in this chapter. The analyzed data set contains 72 Slovenian enterprises’ economic results across four years and indicators of their organizational quality. The authors hypothesize that a causal relationship exists between the latter and the former. In the first part of a two-part process, they use several classification algorithms to study these relationships and to evaluate how accurately they predict the target economic results. However, the most successful models were often very complex and difficult to interpret, especially for non-technical users. Therefore, in the second part, the authors take advantage of a novel general explanation method that can be used to explain the influence of individual features on the model’s prediction. Results show that traditional machine-learning approaches are successful at modeling the dependency relationship. Furthermore, the explanation of the influence of the input features on the predicted economic results provides insights that have a meaningful economic interpretation.


Author(s):  
Musfira Jilani ◽  
Michela Bertolotto ◽  
Padraig Corcoran ◽  
Amerah Alghanim

Nowadays an ever-increasing number of applications require complete and up-to-date spatial data, in particular maps. However, mapping is an expensive process and the vastness and dynamics of our world usually render centralized and authoritative maps outdated and incomplete. In this context crowd-sourced maps have the potential to provide a complete, up-to-date, and free representation of our world. However, the proliferation of such maps largely remains limited due to concerns about their data quality. While most of the current data quality assessment mechanisms for such maps require referencing to authoritative maps, we argue that such referencing of a crowd-sourced spatial database is ineffective. Instead we focus on the use of machine learning techniques that we believe have the potential to not only allow the assessment but also to recommend the improvement of the quality of crowd-sourced maps without referencing to external databases. This chapter gives an overview of these approaches.


2017 ◽  
Author(s):  
Atilla Özgür ◽  
Hamit Erdem

This study investigates the effects of using a large data set on supervised machine learning classifiers in the domain of Intrusion Detection Systems (IDS). To investigate this effect 12 machine learning algorithms have been applied. These algorithms are: (1) Adaboost, (2) Bayesian Nets, (3) Decision Tables, (4) Decision Trees (J48), (5)Logistic Regression, (6) Multi-Layer Perceptron, (7) Naive Bayes, (8) OneRule, (9)Random Forests, (10) Radial Basis Function Neural Networks, (11) Support Vector Machines (two different training algorithms), and (12) ZeroR. A well-known IDS benchmark dataset, KDD99 has been used to train and test classifiers. Full training data set of KDD99 is 4.9 million instances while full test dataset is 311,000 instances. In contrast to similar previous studies, which used 0.08%–10% for training and 1.2%–100% for testing, this study uses full training dataset and full test dataset. Weka Machine Learning Toolbox has been used for modeling and simulation. The performance of classifiers has been evaluated using standard binary performance metrics: Detection Rate, True Positive Rate, True Negative Rate, False Positive Rate, False Negative Rate, Precision, and F1-Rate. To show effects of dataset size, performance of classifiers has been also evaluated using following hardware metrics: Training Time, Working Memory and Model Size. Test results shows improvements in classifiers in standard performance metrics compared to previous studies.


Sign in / Sign up

Export Citation Format

Share Document