Emerging SARS-CoV-2 diversity revealed by rapid whole genome sequence typing

Genome Biology and Evolution ◽

10.1093/gbe/evab197 ◽

2021 ◽

Author(s):

Ahmed M Moustafa ◽

Paul J Planet

Keyword(s):

Machine Learning ◽

Vaccine Uptake ◽

Supervised Machine Learning ◽

Whole Genome Sequence ◽

Viral Diversity ◽

Whole Genome ◽

Travel Restrictions ◽

Hill Numbers ◽

Classification Tool ◽

The Impact

Abstract Background Discrete classification of SARS-CoV-2 viral genotypes can identify emerging strains and detect geographic spread, viral diversity, and transmission events. Methods We developed a tool (GNUVID) that integrates whole genome multilocus sequence typing and a supervised machine learning random forest-based classifier. We used GNUVID to assign sequence type (ST) profiles to all high-quality genomes available from GISAID. STs were clustered into clonal complexes (CCs), and then used to train a machine learning classifier. We used this tool to detect potential introduction and exportation events, and to estimate effective viral diversity across locations and over time in 16 US states. Results GNUVID is a highly scalable tool for viral genotype classification (https://github.com/ahmedmagds/GNUVID) that can quickly classify hundreds of thousands of genomes in a way that is consistent with phylogeny. Our genotyping ST/CC analysis uncovered dynamic local changes in ST/CC prevalence and diversity with multiple replacement events in different states, an average of 20.6 putative introductions and 7.5 exportations for each state over the time period analyzed. We introduce the use of effective diversity metrics (Hill numbers) that can be used to estimate the impact of interventions (eg., travel-restrictions, vaccine uptake, mask mandates) on the variation in circulating viruses. Conclusions Our classification tool uncovered multiple introduction and exportation events, as well as waves of expansion and replacement of SARS-CoV-2 genotypes in different states. GNUVID classification lends itself to measures of ecological diversity, and, with systematic genomic sampling, it could be used to track circulating viral diversity and identify emerging clones and hotspots.

Download Full-text

Emerging SARS-CoV-2 diversity revealed by rapid whole genome sequence typing

10.1101/2020.12.28.424582 ◽

2020 ◽

Author(s):

Ahmed M. Moustafa ◽

Paul J. Planet

Keyword(s):

Machine Learning ◽

Supervised Machine Learning ◽

Whole Genome Sequence ◽

Viral Diversity ◽

Whole Genome ◽

Learning Classifier ◽

Travel Restrictions ◽

Multiple Introduction ◽

Classification Tool

AbstractBackgroundDiscrete classification of SARS-CoV-2 viral genotypes can identify emerging strains and detect geographic spread, viral diversity, and transmission events.MethodsWe developed a tool (GNUVID) that integrates whole genome multilocus sequence typing and a supervised machine learning random forest-based classifier. We used GNUVID to assign sequence type (ST) profiles to each of 69,686 SARS-CoV-2 complete, high-quality genomes available from GISAID as of October 20th 2020. STs were then clustered into clonal complexes (CCs), and then used to train a machine learning classifier. We used this tool to detect potential introduction and exportation events, and to estimate effective viral diversity across locations and over time in 16 US states.ResultsGNUVID is a scalable tool for viral genotype classification (available at https://github.com/ahmedmagds/GNUVID) that can be used to quickly process tens of thousands of genomes. Our genotyping ST/CC analysis uncovered dynamic local changes in ST/CC prevalence and diversity with multiple replacement events in different states. We detected an average of 20.6 putative introductions and 7.5 exportations for each state. Effective viral diversity dropped in all states as shelter-in-place travel-restrictions went into effect and increased as restrictions were lifted. Interestingly, our analysis showed correlation between effective diversity and the date that state-wide mask mandates were imposed.ConclusionsOur classification tool uncovered multiple introduction and exportation events, as well as waves of expansion and replacement of SARS-CoV-2 genotypes in different states. Combined with future genomic sampling the GNUVID system could be used to track circulating viral diversity and identify emerging clones and hotspots.

Download Full-text

Leveraging Road Characteristics and Contributor Behaviour for Assessing Road Type Quality in OSM

ISPRS International Journal of Geo-Information ◽

10.3390/ijgi10070436 ◽

2021 ◽

Vol 10 (7) ◽

pp. 436

Author(s):

Amerah Alghanim ◽

Musfira Jilani ◽

Michela Bertolotto ◽

Gavin McArdle

Keyword(s):

Machine Learning ◽

Spatial Data ◽

Classification Accuracy ◽

Supervised Machine Learning ◽

Machine Learning Techniques ◽

Data Set ◽

Semantic Inference ◽

Road Type ◽

The Impact

Volunteered Geographic Information (VGI) is often collected by non-expert users. This raises concerns about the quality and veracity of such data. There has been much effort to understand and quantify the quality of VGI. Extrinsic measures which compare VGI to authoritative data sources such as National Mapping Agencies are common but the cost and slow update frequency of such data hinder the task. On the other hand, intrinsic measures which compare the data to heuristics or models built from the VGI data are becoming increasingly popular. Supervised machine learning techniques are particularly suitable for intrinsic measures of quality where they can infer and predict the properties of spatial data. In this article we are interested in assessing the quality of semantic information, such as the road type, associated with data in OpenStreetMap (OSM). We have developed a machine learning approach which utilises new intrinsic input features collected from the VGI dataset. Specifically, using our proposed novel approach we obtained an average classification accuracy of 84.12%. This result outperforms existing techniques on the same semantic inference task. The trustworthiness of the data used for developing and training machine learning models is important. To address this issue we have also developed a new measure for this using direct and indirect characteristics of OSM data such as its edit history along with an assessment of the users who contributed the data. An evaluation of the impact of data determined to be trustworthy within the machine learning model shows that the trusted data collected with the new approach improves the prediction accuracy of our machine learning technique. Specifically, our results demonstrate that the classification accuracy of our developed model is 87.75% when applied to a trusted dataset and 57.98% when applied to an untrusted dataset. Consequently, such results can be used to assess the quality of OSM and suggest improvements to the data set.

Download Full-text

Correction: The path to international medals: A supervised machine learning approach to explore the impact of coach-led sport-specific and non-specific practice

PLoS ONE ◽

10.1371/journal.pone.0244509 ◽

2020 ◽

Vol 15 (12) ◽

pp. e0244509

Author(s):

Michael Barth ◽

Arne Güllich ◽

Christian Raschner ◽

Eike Emrich

Keyword(s):

Machine Learning ◽

Supervised Machine Learning ◽

Learning Approach ◽

Machine Learning Approach ◽

Specific Practice ◽

The Impact

Download Full-text

Teleconsultations between Patients and Healthcare Professionals in Primary Care in Catalonia: the Evaluation of Text Classification Algorithms Using Machine Learning

10.20944/preprints201912.0220.v1 ◽

2019 ◽

Author(s):

Francesc López Seguí ◽

Ricardo Ander Egg Aguilar ◽

Gabriel de Maeztu ◽

Anna García-Altés ◽

Francesc García Cuyàs ◽

...

Keyword(s):

Machine Learning ◽

Primary Care ◽

Text Classification ◽

Learning Strategy ◽

Care Service ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Face To Face ◽

Classification Tool ◽

The Impact

Background: the primary care service in Catalonia has operated an asynchronous teleconsulting service between GPs and patients since 2015 (eConsulta), which has generated some 500,000 messages. New developments in big data analysis tools, particularly those involving natural language, can be used to accurately and systematically evaluate the impact of the service. Objective: the study was intended to examine the predictive potential of eConsulta messages through different combinations of vector representation of text and machine learning algorithms and to evaluate their performance. Methodology: 20 machine learning algorithms (based on 5 types of algorithms and 4 text representation techniques)were trained using a sample of 3,559 messages (169,102 words) corresponding to 2,268 teleconsultations (1.57 messages per teleconsultation) in order to predict the three variables of interest (avoiding the need for a face-to-face visit, increased demand and type of use of the teleconsultation). The performance of the various combinations was measured in terms of precision, sensitivity, F-value and the ROC curve. Results: the best-trained algorithms are generally effective, proving themselves to be more robust when approximating the two binary variables "avoiding the need of a face-to-face visit" and "increased demand" (precision = 0.98 and 0.97, respectively) rather than the variable "type of query"(precision = 0.48). Conclusion: to the best of our knowledge, this study is the first to investigate a machine learning strategy for text classification using primary care teleconsultation datasets. The study illustrates the possible capacities of text analysis using artificial intelligence. The development of a robust text classification tool could be feasible by validating it with more data, making it potentially more useful for decision support for health professionals.

Download Full-text

Predicting the Mechanical Properties of RCA-Based Concrete Using Supervised Machine Learning Algorithms

Materials ◽

10.3390/ma15020647 ◽

2022 ◽

Vol 15 (2) ◽

pp. 647

Author(s):

Meijun Shang ◽

Hejun Li ◽

Ayaz Ahmad ◽

Waqas Ahmad ◽

Krzysztof Adam Ostrowski ◽

...

Keyword(s):

Machine Learning ◽

Mechanical Properties ◽

Mean Square Error ◽

Coarse Aggregate ◽

Machine Learning Algorithms ◽

Supervised Machine Learning ◽

Environmental Damage ◽

Fine Aggregate ◽

Mean Square ◽

The Impact

Environment-friendly concrete is gaining popularity these days because it consumes less energy and causes less damage to the environment. Rapid increases in the population and demand for construction throughout the world lead to a significant deterioration or reduction in natural resources. Meanwhile, construction waste continues to grow at a high rate as older buildings are destroyed and demolished. As a result, the use of recycled materials may contribute to improving the quality of life and preventing environmental damage. Additionally, the application of recycled coarse aggregate (RCA) in concrete is essential for minimizing environmental issues. The compressive strength (CS) and splitting tensile strength (STS) of concrete containing RCA are predicted in this article using decision tree (DT) and AdaBoost machine learning (ML) techniques. A total of 344 data points with nine input variables (water, cement, fine aggregate, natural coarse aggregate, RCA, superplasticizers, water absorption of RCA and maximum size of RCA, density of RCA) were used to run the models. The data was validated using k-fold cross-validation and the coefficient correlation coefficient (R2), mean square error (MSE), mean absolute error (MAE), and root mean square error values (RMSE). However, the model’s performance was assessed using statistical checks. Additionally, sensitivity analysis was used to determine the impact of each variable on the forecasting of mechanical properties.

Download Full-text

Insider Threat Detection Using Supervised Machine Learning Algorithms on an Extremely Imbalanced Dataset

International Journal of Cyber Warfare and Terrorism ◽

10.4018/ijcwt.2020040101 ◽

2020 ◽

Vol 10 (2) ◽

pp. 1-26

Author(s):

Naghmeh Moradpoor Sheykhkanloo ◽

Adam Hall

Keyword(s):

Machine Learning ◽

Performance Metrics ◽

Machine Learning Algorithms ◽

Third Party ◽

Supervised Machine Learning ◽

Machine Learning Techniques ◽

Insider Threat ◽

Threat Detection ◽

Imbalanced Dataset ◽

The Impact

An insider threat can take on many forms and fall under different categories. This includes malicious insider, careless/unaware/uneducated/naïve employee, and the third-party contractor. Machine learning techniques have been studied in published literature as a promising solution for such threats. However, they can be biased and/or inaccurate when the associated dataset is hugely imbalanced. Therefore, this article addresses the insider threat detection on an extremely imbalanced dataset which includes employing a popular balancing technique known as spread subsample. The results show that although balancing the dataset using this technique did not improve performance metrics, it did improve the time taken to build the model and the time taken to test the model. Additionally, the authors realised that running the chosen classifiers with parameters other than the default ones has an impact on both balanced and imbalanced scenarios, but the impact is significantly stronger when using the imbalanced dataset.

Download Full-text

Sensitivity of Human Papillomavirus (HPV) Lineage and Sublineage Variant Pseudoviruses to Neutralization by Nonavalent Vaccine Antibodies

The Journal of Infectious Diseases ◽

10.1093/infdis/jiz401 ◽

2019 ◽

Vol 220 (12) ◽

pp. 1940-1945 ◽

Cited By ~ 2

Author(s):

Anna Godi ◽

Troy J Kemp ◽

Ligia A Pinto ◽

Simon Beddows

Keyword(s):

Human Papillomavirus ◽

Genome Sequence ◽

Protein Function ◽

Whole Genome Sequence ◽

Whole Genome ◽

Neutralization Sensitivity ◽

Natural Variants ◽

The Impact ◽

Induced Immunity ◽

Low Magnitude

Abstract Natural variants of human papillomavirus (HPV) are classified into lineages and sublineages based upon whole-genome sequence, but the impact of diversity on protein function is unclear. We investigated the susceptibility of 3–8 representative pseudovirus variants of HPV16, HPV18, HPV31, HPV33, HPV45, HPV52, and HPV58 to neutralization by nonavalent vaccine (Gardasil®9) sera. Many variants demonstrated significant differences in neutralization sensitivity from their consensus A/A1 variant but these were of a low magnitude. HPV52 D and HPV58 C variants exhibited >4-fold reduced sensitivities compared to their consensus A/A1 variant and should be considered distinct serotypes with respect to nonavalent vaccine-induced immunity.

Download Full-text

Sequencing era methods for identifying signatures of selection in the genome

Briefings in Bioinformatics ◽

10.1093/bib/bby064 ◽

2018 ◽

Vol 20 (6) ◽

pp. 1997-2008 ◽

Cited By ~ 3

Author(s):

Clare Horscroft ◽

Sarah Ennis ◽

Reuben J Pengelly ◽

Timothy J Sluckin ◽

Andrew Collins

Keyword(s):

Genome Sequence ◽

Sequence Data ◽

Large Data ◽

Substantial Improvement ◽

Data Availability ◽

Whole Genome Sequence ◽

Whole Genome ◽

Genome Sequence Data ◽

Signatures Of Selection ◽

The Impact

Abstract Insights into genetic loci which are under selection and their functional roles contribute to increased understanding of the patterns of phenotypic variation we observe today. The availability of whole-genome sequence data, for humans and other species, provides opportunities to investigate adaptation and evolution at unprecedented resolution. Many analytical methods have been developed to interrogate these large data sets and characterize signatures of selection in the genome. We review here recently developed methods and consider the impact of increased computing power and data availability on the detection of selection signatures. Consideration of demography, recombination and other confounding factors is important, and use of a range of methods in combination is a powerful route to resolving different forms of selection in genome sequence data. Overall, a substantial improvement in methods for application to whole-genome sequencing is evident, although further work is required to develop robust and computationally efficient approaches which may increase reproducibility across studies.

Download Full-text

Tiling the genome into consistently named subsequences enables precision medicine and machine learning with millions of complex individual data-sets

10.7287/peerj.preprints.1426 ◽

2015 ◽

Author(s):

Sarah Guthrie ◽

Abram Connelly ◽

Peter Amstutz ◽

Adam F. Berrey ◽

Nicolas Cesar ◽

...

Keyword(s):

Machine Learning ◽

Precision Medicine ◽

Blood Type ◽

Whole Genome Sequence ◽

Medical Community ◽

Support Vector ◽

Personal Genome ◽

Whole Genome ◽

Genome Sequences ◽

Global Alliance

The scientific and medical community is reaching an era of inexpensive whole genome sequencing, opening the possibility of precision medicine for millions of individuals. Here we present tiling: a flexible representation of whole genome sequences that supports simple and consistent names, annotation, queries, machine learning, and clinical screening. We partitioned the genome into 10,655,006 tiles: overlapping, variable-length sequences that begin and end with unique 24-base tags. We tiled and annotated 680 public whole genome sequences from the 1000 Genomes Project Consortium (1KG) and Harvard Personal Genome Project (PGP) using ClinVar database information. These genomes cover 14.13 billion tile sequences (4.087 trillion high quality bases and 0.4321 trillion low quality bases) and 251 phenotypes spanning ICD-9 code ranges 140-289, 320-629, and 680-759. We used these data to build a Global Alliance for Genomics and Health Beacon and graph database. We performed principal component analysis (PCA) on the 680 public whole genomes, and by projecting the tiled genomes onto their first two principal components, we replicated the 1KG principle component separation by population ethnicity codes. Interestingly, we found the PGP self reported ethnicities cluster consistently with 1KG ethnicity codes. We built a set of support-vector ABO blood-type classifiers using 75 PGP participants who had both a whole genome sequence and a self-reported blood type. Our classifier predicts A antigen presence to within 1% of the current state-of-the art for in silico A antigen prediction. Finally, we found six PGP participants with previously undiscovered pathogenic BRCA variants, and using our tiling, gave them simple, consistent names, which can be easily and independently re-derived. Given the near-future requirements of genomics research and precision medicine, we propose the adoption of tiling and invite all interested individuals and groups to view, rerun, copy, and modify these analyses at https://curover.se/su92l- j7d0g-swtofxa2rct8495

Download Full-text