Effective training data extraction method to improve influenza outbreak prediction from online news articles (Preprint)

Mapping Intimacies ◽

10.2196/preprints.23305 ◽

2020 ◽

Author(s):

Beakcheol Jang ◽

Inhwan Kim

Keyword(s):

Extraction Method ◽

Prediction Accuracy ◽

Data Extraction ◽

Extraction Process ◽

Training Data ◽

Influenza Surveillance ◽

Word Embeddings ◽

Influenza Outbreak ◽

Effective Training ◽

Sorting Process

BACKGROUND Each year, influenza affects 3 to 5 million people and causes 290,000 to 650,000 fatalities worldwide. To reduce the fatalities caused by influenza, several countries have established influenza surveillance systems to collect early-warning data. However, proper and timely warnings are hindered by a 1 to 2 weeks delay between the actual disease outbreaks and the publication of surveillance data. To avoid this delay of traditional monitoring methods, novel methods have been proposed for influenza surveillance and prediction by using real-time internet data (such as search queries, microblogging, and news). Some of the currently popular approaches extract online data and use machine learning to predict influenza occurrences in a classification mode. However, many of these methods extract training data subjectively, and it is difficult to capture the latent characteristics of the data correctly. There is a critical need to devise new approaches that focus on extracting training data by reflecting the latent characteristics of the data. OBJECTIVE In this paper, we propose an effective training data extraction method that reflects the hidden features and improves the performance by filtering and selecting only the keywords related to influenza before the prediction. METHODS Although the word embeddings provide a distributed representation of words by encoding the hidden relationships between various tokens, we enhance the word embeddings by selecting keywords related to the influenza outbreak and sorting the extracted keywords using the Pearson correlation coefficient (PCC) in order of correlation with the influenza outbreak. The keyword extraction process is followed by a predictive model based on long short-term memory (LSTM) that predicts the influenza outbreak. To assess the performance of the proposed predictive model, we use and compare a variety of word embeddings. RESULTS Word embeddings without our proposed sorting process showed 0.8705 prediction accuracy when 50.2 keywords were selected on average. On the other hand, word embeddings using our proposed sorting process showed 0.8868 prediction accuracy and 12.6% prediction accuracy improvement although smaller amount of training data are selected with only 20.6 keywords on average. CONCLUSIONS The sorting process empowers the embedding process, which improves the feature extraction process because it acts as a knowledge base for the prediction component. The model outperforms other current approaches that use flat extraction before prediction.

Download Full-text

Effective Training Data Extraction Method to Improve Influenza Outbreak Prediction from Online News Articles (Preprint)

JMIR Medical Informatics ◽

10.2196/23305 ◽

2020 ◽

Author(s):

Beakcheol Jang ◽

Inhwan Kim ◽

Jong Wook Kim

Keyword(s):

Extraction Method ◽

Data Extraction ◽

Online News ◽

Training Data ◽

Influenza Outbreak ◽

Effective Training

Download Full-text

Parallel Data Extraction using Word Embeddings

10.5121/csit.2020.101521 ◽

2020 ◽

Author(s):

Pintu Lohar ◽

Andy Way

Keyword(s):

Information Retrieval ◽

Data Extraction ◽

Training Data ◽

User Generated Content ◽

Data Sets ◽

Word Embeddings ◽

Parallel Corpus ◽

Comparable Corpora ◽

Bilingual Dictionary ◽

Parallel Data

Building a robust MT system requires a sufficiently large parallel corpus to be available as training data. In this paper, we propose to automatically extract parallel sentences from comparable corpora without using any MT system or even any parallel corpus at all. Instead, we use crosslingual information retrieval (CLIR), average word embeddings, text similarity and a bilingual dictionary, thus saving a significant amount of time and effort as no MT system is involved in this process. We conduct experiments on two different kinds of data: (i) formal texts from news domain, and (ii) user-generated content (UGC) from hotel reviews. The automatically extracted sentence pairs are then added to the already available parallel training data and the extended translation models are built from the concatenated data sets. Finally, we compare the performance of our new extended models against the baseline models built from the available data. The experimental evaluation reveals that our proposed approach is capable of improving the translation outputs for both the formal texts and UGC.

Download Full-text

Data Extraction Method from Printed Images with Different Formats

IEICE Transactions on Fundamentals of Electronics Communications and Computer Sciences ◽

10.1587/transfun.e100.a.2355 ◽

2017 ◽

Vol E100.A (11) ◽

pp. 2355-2357

Author(s):

Mitsuji MUNEYASU ◽

Nayuta JINDA ◽

Yuuya MORITANI ◽

Soh YOSHIDA

Keyword(s):

Extraction Method ◽

Data Extraction

Download Full-text

Optimal breeding-value prediction using a Sparse Selection Index

Genetics ◽

10.1093/genetics/iyab030 ◽

2021 ◽

Author(s):

Marco Lopez-Cruz ◽

Gustavo de los Campos

Keyword(s):

Sample Size ◽

Dna Sequences ◽

Genomic Prediction ◽

Prediction Accuracy ◽

Regularization Parameter ◽

Selection Index ◽

Prediction Method ◽

Training Data ◽

Breeding Value ◽

Data Set

Abstract Genomic prediction uses DNA sequences and phenotypes to predict genetic values. In homogeneous populations, theory indicates that the accuracy of genomic prediction increases with sample size. However, differences in allele frequencies and in linkage disequilibrium patterns can lead to heterogeneity in SNP effects. In this context, calibrating genomic predictions using a large, potentially heterogeneous, training data set may not lead to optimal prediction accuracy. Some studies tried to address this sample size/homogeneity trade-off using training set optimization algorithms; however, this approach assumes that a single training data set is optimum for all individuals in the prediction set. Here, we propose an approach that identifies, for each individual in the prediction set, a subset from the training data (i.e., a set of support points) from which predictions are derived. The methodology that we propose is a Sparse Selection Index (SSI) that integrates Selection Index methodology with sparsity-inducing techniques commonly used for high-dimensional regression. The sparsity of the resulting index is controlled by a regularization parameter (λ); the G-BLUP (the prediction method most commonly used in plant and animal breeding) appears as a special case which happens when λ = 0. In this study, we present the methodology and demonstrate (using two wheat data sets with phenotypes collected in ten different environments) that the SSI can achieve significant (anywhere between 5-10%) gains in prediction accuracy relative to the G-BLUP.

Download Full-text

3DP Printing of Oral Solid Formulations: A Systematic Review

Pharmaceutics ◽

10.3390/pharmaceutics13030358 ◽

2021 ◽

Vol 13 (3) ◽

pp. 358 ◽

Cited By ~ 1

Author(s):

Chiara R. M. Brambilla ◽

Ogochukwu Lilian Okafor-Muo ◽

Hany Hassanin ◽

Amr ElShaer

Keyword(s):

Systematic Review ◽

Data Extraction ◽

New Technology ◽

Three Dimensional ◽

Extraction Process ◽

Carcinogenic Risk ◽

Drug Dissolution ◽

Dissolution Profile ◽

Disintegration Time ◽

Advantages And Disadvantages

Three-dimensional (3D) printing is a recent technology, which gives the possibility to manufacture personalised dosage forms and it has a broad range of applications. One of the most developed, it is the manufacture of oral solid dosage and the four 3DP techniques which have been more used for their manufacture are FDM, inkjet 3DP, SLA and SLS. This systematic review is carried out to statistically analyze the current 3DP techniques employed in manufacturing oral solid formulations and assess the recent trends of this new technology. The work has been organised into four steps, (1) screening of the articles, definition of the inclusion and exclusion criteria and classification of the articles in the two main groups (included/excluded); (2) quantification and characterisation of the included articles; (3) evaluation of the validity of data and data extraction process; (4) data analysis, discussion, and conclusion to define which technique offers the best properties to be applied in the manufacture of oral solid formulations. It has been observed that with SLS 3DP technique, all the characterisation tests required by the BP (drug content, drug dissolution profile, hardness, friability, disintegration time and uniformity of weight) have been performed in the majority of articles, except for the friability test. However, it is not possible to define which of the four 3DP techniques is the most suitable for the manufacture of oral solid formulations, because the selection is affected by different parameters, such as the type of formulation, the physical-mechanical properties to achieve. Moreover, each technique has its specific advantages and disadvantages, such as for FDM the biggest challenge is the degradation of the drug, due to high printing temperature process or for SLA is the toxicity of the carcinogenic risk of the photopolymerising material.

Download Full-text

Rethinking the Random Cropping Data Augmentation Method Used in the Training of CNN-Based SAR Image Ship Detector

Remote Sensing ◽

10.3390/rs13010034 ◽

2020 ◽

Vol 13 (1) ◽

pp. 34

Author(s):

Rong Yang ◽

Robert Wang ◽

Yunkai Deng ◽

Xiaoxue Jia ◽

Heng Zhang

Keyword(s):

Neural Network ◽

Data Augmentation ◽

Back Propagation ◽

Detection Performance ◽

Training Data ◽

Sar Image ◽

Optical Images ◽

The Neural Network ◽

Effective Training ◽

Standard Configuration

The random cropping data augmentation method is widely used to train convolutional neural network (CNN)-based target detectors to detect targets in optical images (e.g., COCO datasets). It can expand the scale of the dataset dozens of times while consuming only a small amount of calculations when training the neural network detector. In addition, random cropping can also greatly enhance the spatial robustness of the model, because it can make the same target appear in different positions of the sample image. Nowadays, random cropping and random flipping have become the standard configuration for those tasks with limited training data, which makes it natural to introduce them into the training of CNN-based synthetic aperture radar (SAR) image ship detectors. However, in this paper, we show that the introduction of traditional random cropping methods directly in the training of the CNN-based SAR image ship detector may generate a lot of noise in the gradient during back propagation, which hurts the detection performance. In order to eliminate the noise in the training gradient, a simple and effective training method based on feature map mask is proposed. Experiments prove that the proposed method can effectively eliminate the gradient noise introduced by random cropping and significantly improve the detection performance under a variety of evaluation indicators without increasing inference cost.

Download Full-text

Integrated Metabolomics and Transcriptomics Using an Optimised Dual Extraction Process to Study Human Brain Cancer Cells and Tissues

Metabolites ◽

10.3390/metabo11040240 ◽

2021 ◽

Vol 11 (4) ◽

pp. 240

Author(s):

Alison Woodward ◽

Alina Pandele ◽

Salah Abdelrazig ◽

Catherine A. Ortori ◽

Iqbal Khan ◽

...

Keyword(s):

Metabolic Pathways ◽

Extraction Method ◽

Limit Of Detection ◽

Rna Extraction ◽

Extraction Process ◽

Extraction Methods ◽

Serum Starvation ◽

Extraction Techniques ◽

Metabolic Genes ◽

Dual Extraction

The integration of untargeted metabolomics and transcriptomics from the same population of cells or tissue enhances the confidence in the identified metabolic pathways and understanding of the enzyme–metabolite relationship. Here, we optimised a simultaneous extraction method of metabolites/lipids and RNA from ependymoma cells (BXD-1425). Relative to established RNA (mirVana kit) or metabolite (sequential solvent addition and shaking) single extraction methods, four dual-extraction techniques were evaluated and compared (methanol:water:chloroform ratios): cryomill/mirVana (1:1:2); cryomill-wash/Econospin (5:1:2); rotation/phenol-chloroform (9:10:1); Sequential/mirVana (1:1:3). All methods extracted the same metabolites, yet rotation/phenol-chloroform did not extract lipids. Cryomill/mirVana and sequential/mirVana recovered the highest amounts of RNA, at 70 and 68% of that recovered with mirVana kit alone. sequential/mirVana, involving RNA extraction from the interphase of our established sequential solvent addition and shaking metabolomics-lipidomics extraction method, was the most efficient approach overall. Sequential/mirVana was applied to study a) the biological effect caused by acute serum starvation in BXD-1425 cells and b) primary ependymoma tumour tissue. We found (a) 64 differentially abundant metabolites and 28 differentially expressed metabolic genes, discovering four gene-metabolite interactions, and (b) all metabolites and 62% lipids were above the limit of detection, and RNA yield was sufficient for transcriptomics, in just 10 mg of tissue.

Download Full-text

Effective training data selection in tool condition monitoring system

International Journal of Machine Tools and Manufacture ◽

10.1016/j.ijmachtools.2005.04.005 ◽

2006 ◽

Vol 46 (2) ◽

pp. 218-224 ◽

Cited By ~ 23

Author(s):

J. Sun ◽

G.S. Hong ◽

Y.S. Wong ◽

M. Rahman ◽

Z.G. Wang

Keyword(s):

Monitoring System ◽

Condition Monitoring ◽

Tool Condition Monitoring ◽

Training Data ◽

Data Selection ◽

Tool Condition ◽

Effective Training ◽

Condition Monitoring System ◽

Training Data Selection

Download Full-text

Extraction of 2-Acetyl-1-Pyrroline (2AP) in Pandan Leaves (Pandanus Amaryllifolius Roxb.) Via Solvent Extraction Method: Effect of Solvent

Jurnal Teknologi ◽

10.11113/jt.v67.2735 ◽

2014 ◽

Vol 67 (2) ◽

Cited By ~ 1

Author(s):

Norzita Ngadi ◽

Noor Yahida Yahya

Keyword(s):

Solvent Extraction ◽

Extraction Method ◽

Extraction Process ◽

Natural Sources ◽

Effect Of Solvent ◽

Method Effect ◽

Solvent Extraction Method ◽

Pandanus Amaryllifolius

Pandan (Pandanus amaryllifolius Roxb.) leaves are widely used in Malaysia as a source of natural flavoring. The major compound contributing to the characteristic flavour of Pandan is 2-acetyl-1-pyrroline (2AP). As the consumer requirement for use of natural flavours, extraction of components from natural sources has been sought. In this study, solvent extraction of 2AP from Pandan leaves was performed. The effect of solvent used during extraction process (i.e. methanol, ethanol, propanol) towards the yield of 2AP was investigated. The presence of 2AP was determined using GCMS. The results obtained showed that ethanol was the best solvent to extract 2AP from Pandan leaves compared to methanol as higher 2AP peak arises from ethanol chromatogram. However there is no 2AP detected when propanol was used as solvent. It is believed that polarity of the solvent plays an important role in the extraction of 2AP.

Download Full-text

Simulation of Stock Prediction System using Artificial Neural Networks

International Journal of Business Analytics ◽

10.4018/ijban.2016070102 ◽

2016 ◽

Vol 3 (3) ◽

pp. 25-44 ◽

Cited By ~ 1

Author(s):

Omisore Olatunji Mumini ◽

Fayemiwo Michael Adebisi ◽

Ofoegbu Osita Edward ◽

Adeniyi Shukurat Abidemi

Keyword(s):

Test Data ◽

Stock Prices ◽

Prediction Accuracy ◽

Stock Exchange ◽

Training Data ◽

Prediction System ◽

Stock Trading ◽

Closing Price ◽

Non Linear ◽

Predicted Values

Stock trading, used to predict the direction of future stock prices, is a dynamic business primarily based on human intuition. This involves analyzing some non-linear fundamental and technical stock variables which are recorded periodically. This study presents the development of an ANN-based prediction model for forecasting closing price in the stock markets. The major steps taken are identification of technical variables used for prediction of stock prices, collection and pre-processing of stock data, and formulation of the ANN-based predictive model. Stock data of periods between 2010 and 2014 were collected from the Nigerian Stock Exchange (NSE) and stored in a database. The data collected were classified into training and test data, where the training data was used to learn non-linear patterns that exist in the dataset; and test data was used to validate the prediction accuracy of the model. Evaluation results obtained from WEKA shows that discrepancies between actual and predicted values are insignificant.

Download Full-text