Machine Learning Data Imputation and Prediction of Foraging Group Size in a Kleptoparasitic Spider

Yong-Chao Su; Cheng-Yu Wu; Cheng-Hong Yang; Bo-Sheng Li; Sin-Hua Moi; Yu-Da Lin

doi:10.3390/math9040415

Machine Learning Data Imputation and Prediction of Foraging Group Size in a Kleptoparasitic Spider

Mathematics ◽

10.3390/math9040415 ◽

2021 ◽

Vol 9 (4) ◽

pp. 415

Author(s):

Yong-Chao Su ◽

Cheng-Yu Wu ◽

Cheng-Hong Yang ◽

Bo-Sheng Li ◽

Sin-Hua Moi ◽

...

Keyword(s):

Machine Learning ◽

Group Size ◽

Incomplete Data ◽

Field Data ◽

Ideal Free Distribution ◽

Rank Test ◽

Significant Feature ◽

P Value ◽

Data Imputation ◽

Spider Webs

Cost–benefit analysis is widely used to elucidate the association between foraging group size and resource size. Despite advances in the development of theoretical frameworks, however, the empirical systems used for testing are hindered by the vagaries of field surveys and incomplete data. This study developed the three approaches to data imputation based on machine learning (ML) algorithms with the aim of rescuing valuable field data. Using 163 host spider webs (132 complete data and 31 incomplete data), our results indicated that the data imputation based on random forest algorithm outperformed classification and regression trees, the k-nearest neighbor, and other conventional approaches (Wilcoxon signed-rank test and correlation difference have p-value from < 0.001–0.030). We then used rescued data based on a natural system involving kleptoparasitic spiders from Taiwan and Vietnam (Argyrodes miniaceus, Theridiidae) to test the occurrence and group size of kleptoparasites in natural populations. Our partial least-squares path modelling (PLS-PM) results demonstrated that the size of the host web (T = 6.890, p = 0.000) is a significant feature affecting group size. The resource size (T = 2.590, p = 0.010) and the microclimate (T = 3.230, p = 0.001) are significant features affecting the presence of kleptoparasites. The test of conformation of group size distribution to the ideal free distribution (IFD) model revealed that predictions pertaining to per-capita resource size were underestimated (bootstrap resampling mean slopes <IFD predicted slopes, p < 0.001). These findings highlight the importance of applying appropriate ML methods to the handling of missing field data.

Download Full-text

Self–Training With Quantile Errors for Multivariate Missing Data Imputation for Regression Problems in Electronic Medical Records: Algorithm Development Study

JMIR Public Health and Surveillance ◽

10.2196/30824 ◽

2021 ◽

Vol 7 (10) ◽

pp. e30824

Author(s):

Hansle Gwon ◽

Imjin Ahn ◽

Yunha Kim ◽

Hee Jun Kang ◽

Hyeram Seo ◽

...

Keyword(s):

Machine Learning ◽

Incomplete Data ◽

Missing Values ◽

Pearson Correlation ◽

Laboratory Data ◽

Complete Data ◽

Learning System ◽

Rank Test ◽

P Value ◽

Missing Value

Background When using machine learning in the real world, the missing value problem is the first problem encountered. Methods to impute this missing value include statistical methods such as mean, expectation-maximization, and multiple imputations by chained equations (MICE) as well as machine learning methods such as multilayer perceptron, k-nearest neighbor, and decision tree. Objective The objective of this study was to impute numeric medical data such as physical data and laboratory data. We aimed to effectively impute data using a progressive method called self-training in the medical field where training data are scarce. Methods In this paper, we propose a self-training method that gradually increases the available data. Models trained with complete data predict the missing values in incomplete data. Among the incomplete data, the data in which the missing value is validly predicted are incorporated into the complete data. Using the predicted value as the actual value is called pseudolabeling. This process is repeated until the condition is satisfied. The most important part of this process is how to evaluate the accuracy of pseudolabels. They can be evaluated by observing the effect of the pseudolabeled data on the performance of the model. Results In self-training using random forest (RF), mean squared error was up to 12% lower than pure RF, and the Pearson correlation coefficient was 0.1% higher. This difference was confirmed statistically. In the Friedman test performed on MICE and RF, self-training showed a P value between .003 and .02. A Wilcoxon signed-rank test performed on the mean imputation showed the lowest possible P value, 3.05e-5, in all situations. Conclusions Self-training showed significant results in comparing the predicted values and actual values, but it needs to be verified in an actual machine learning system. And self-training has the potential to improve performance according to the pseudolabel evaluation method, which will be the main subject of our future research.

Download Full-text

Pairwise Correlation Analysis of the Alzheimer’s Disease Neuroimaging Initiative (ADNI) Dataset Reveals Significant Feature Correlation

Genes ◽

10.3390/genes12111661 ◽

2021 ◽

Vol 12 (11) ◽

pp. 1661

Author(s):

Erik D. Huckvale ◽

Matthew W. Hodgman ◽

Brianna B. Greenwood ◽

Devorah O. Stucki ◽

Katrisa M. Ward ◽

...

Keyword(s):

Machine Learning ◽

Alzheimer’S Disease ◽

Alzheimer's Disease ◽

Machine Learning Algorithms ◽

Computational Time ◽

Significant Feature ◽

P Value ◽

Mri Features ◽

Feature Correlation ◽

Highly Correlated

The Alzheimer’s Disease Neuroimaging Initiative (ADNI) contains extensive patient measurements (e.g., magnetic resonance imaging [MRI], biometrics, RNA expression, etc.) from Alzheimer’s disease (AD) cases and controls that have recently been used by machine learning algorithms to evaluate AD onset and progression. While using a variety of biomarkers is essential to AD research, highly correlated input features can significantly decrease machine learning model generalizability and performance. Additionally, redundant features unnecessarily increase computational time and resources necessary to train predictive models. Therefore, we used 49,288 biomarkers and 793,600 extracted MRI features to assess feature correlation within the ADNI dataset to determine the extent to which this issue might impact large scale analyses using these data. We found that 93.457% of biomarkers, 92.549% of the gene expression values, and 100% of MRI features were strongly correlated with at least one other feature in ADNI based on our Bonferroni corrected α (p-value ≤ 1.40754 × 10−13). We provide a comprehensive mapping of all ADNI biomarkers to highly correlated features within the dataset. Additionally, we show that significant correlation within the ADNI dataset should be resolved before performing bulk data analyses, and we provide recommendations to address these issues. We anticipate that these recommendations and resources will help guide researchers utilizing the ADNI dataset to increase model performance and reduce the cost and complexity of their analyses.

Download Full-text

Machine-learning based data recovery and its contribution to seismic acquisition: simultaneous application of deblending, trace reconstruction and low-frequency extrapolation

Geophysics ◽

10.1190/geo2020-0303.1 ◽

2020 ◽

pp. 1-62

Author(s):

Shotaro Nakayama ◽

Gerrit Blacquière

Keyword(s):

Machine Learning ◽

Incomplete Data ◽

Field Data ◽

Low Frequency ◽

Cost Effective ◽

Complete Data ◽

Prediction Errors ◽

Low Frequencies ◽

Simultaneous Application ◽

Frequency Space

Acquisition of incomplete data, i.e., blended, sparsely sampled, and narrowband data, allows for cost-effective and efficient field seismic operations. This strategy becomes technically acceptable, provided that a satisfactory recovery of the complete data, i.e., deblended, well-sampled and broadband data, is attainable. Hence, we explore a machine-learning approach that simultaneously performs suppression of blending noise, reconstruction of missing traces and extrapolation of low frequencies. We apply a deep convolutional neural network in the framework of supervised learning where we train a network using pairs of incomplete-complete datasets. Incomplete data, which are never used for training and employ different subsurface properties and acquisition scenarios, are subsequently fed into the trained network to predict complete data. We describe matrix representations indicating the contributions of different acquisition strategies to reducing the field operational effort. We also illustrate that the simultaneous implementation of source blending, sparse geometry and band limitation leads to a significant data compression where the size of the incomplete data in the frequency-space domain is much smaller than the size of the complete data. This reduction is indicative of survey cost and duration that our acquisition strategy can save. Both synthetic and field data examples demonstrate the applicability of the proposed approach. Despite the reduced amount of information available in the incomplete data, the results obtained from both numerical and field data cases clearly show that the machine-learning scheme effectively performs deblending, trace reconstruction, and low-frequency extrapolation in a simultaneous fashion. It is noteworthy that no discernible difference in prediction errors between extrapolated frequencies and preexisting frequencies is observed. The approach potentially allows seismic data to be acquired in a significantly compressed manner, while subsequently recovering data of satisfactory quality.

Download Full-text

Self–Training With Quantile Errors for Multivariate Missing Data Imputation for Regression Problems in Electronic Medical Records: Algorithm Development Study (Preprint)

10.2196/preprints.30824 ◽

2021 ◽

Author(s):

Hansle Gwon ◽

Imjin Ahn ◽

Yunha Kim ◽

Hee Jun Kang ◽

Hyeram Seo ◽

...

Keyword(s):

Machine Learning ◽

Incomplete Data ◽

Missing Values ◽

Pearson Correlation ◽

Laboratory Data ◽

Complete Data ◽

Learning System ◽

Training Data ◽

Rank Test ◽

Missing Value

BACKGROUND When using machine learning in the real world, the missing value problem is the first problem encountered. Methods to impute this missing value include statistical methods such as mean, expectation-maximization, and multiple imputations by chained equations (MICE) as well as machine learning methods such as multilayer perceptron, k-nearest neighbor, and decision tree. OBJECTIVE The objective of this study was to impute numeric medical data such as physical data and laboratory data. We aimed to effectively impute data using a progressive method called self-training in the medical field where training data are scarce. METHODS In this paper, we propose a self-training method that gradually increases the available data. Models trained with complete data predict the missing values in incomplete data. Among the incomplete data, the data in which the missing value is validly predicted are incorporated into the complete data. Using the predicted value as the actual value is called pseudolabeling. This process is repeated until the condition is satisfied. The most important part of this process is how to evaluate the accuracy of pseudolabels. They can be evaluated by observing the effect of the pseudolabeled data on the performance of the model. RESULTS In self-training using random forest (RF), mean squared error was up to 12% lower than pure RF, and the Pearson correlation coefficient was 0.1% higher. This difference was confirmed statistically. In the Friedman test performed on MICE and RF, self-training showed a P value between .003 and .02. A Wilcoxon signed-rank test performed on the mean imputation showed the lowest possible P value, 3.05e-5, in all situations. CONCLUSIONS Self-training showed significant results in comparing the predicted values and actual values, but it needs to be verified in an actual machine learning system. And self-training has the potential to improve performance according to the pseudolabel evaluation method, which will be the main subject of our future research.

Download Full-text

Change in Visual Acuity in Relation to Central Macular Thickness after Intravitreal Bevacizumab in Diabetic Macular Edema

Pakistan Journal of Ophthalmology ◽

10.36351/pjo.v36i3.1051 ◽

2020 ◽

Vol 36 (3) ◽

Author(s):

Muhammad Ali Haider ◽

Uzma Sattar ◽

Syeda Rushda Zaidi

Keyword(s):

Visual Acuity ◽

Macular Edema ◽

Diabetic Macular Edema ◽

Intravitreal Bevacizumab ◽

Macular Thickness ◽

Rank Test ◽

P Value ◽

Central Macular Thickness ◽

Anti Vegf ◽

Quasi Experimental

Purpose: To evaluate the change in visual acuity in relation to decrease in central macular thickness,after a single dose of intravitreal Bevacizumab injection.Study Design: Quasi experimental study.Place and Duration of Study: Punjab Rangers Teaching Hospital, Lahore, from January 2019 to June 2019.Material and Methods: 70 eyes with diabetic macular edema were included in the study. Patients having high refractive errors (spherical equivalent of > ± 7.5D) and visual acuity worse than +1.2 or better than +0.2 on log MAR were excluded. Central macular edema was measured in μm on OCT and visual acuity was documentedusing Log MAR chart. These values were documented before and at 01 month after injection with intravitrealBevacizumab. Wilcoxon Signed rank test was used to evaluate the difference in VA beforeand after the anti-VEGF injection. Difference in visual acuity and macular edema (central) was observed,analyzed and represented in p value. P value was considered statistically significant if it was less than 0.01%.Results: Mean age of patients was 52.61 ± 1.3. Vision improved from 0.90 ± 0.02 to 0.84 ± 0.02 on log MARchart. The change was statistically significant with p value < 0.001. Central macular thickness reduced from 328 ±14 to 283 ± 10.6 μm on OCT after intravitreal anti-VEGF, with significant p value < 0.001.Conclusion: A 45 μm reduction in central macular thickness was associated with 0.1 Log MAR unit improvementin visual acuity after intravitreal Bevacizumab in diabetic macular edema.

Download Full-text

Self Management Menentukan Kualitas Hidup Pasien Diabetes Mellitus

Jurnal Endurance ◽

10.22216/jen.v4i2.4026 ◽

2019 ◽

Vol 4 (2) ◽

pp. 402

Author(s):

Iskim Luthfa ◽

Nurul Fadhilah

Keyword(s):

Quality Of Life ◽

Diabetes Mellitus ◽

Correlation Study ◽

Self Management ◽

Rank Test ◽

P Value ◽

Cross Sectional ◽

Spearman Rank ◽

Consecutive Sampling

People with diabetes mellitus are at risk of developing complications, so that it affects the quality of life. These complications can be minimized through self-care management. This study aims to determine the relationship between self management with the quality of life for people with diabetes mellitus. This research is a kind of quantitative research with correlation study. This research used cross sectional design. The sampling technique uses non probability with estimation consecutive sampling. The number of respondents in this research are 118 respondents. Instrument for measuring self management used diabetes self management questionnaire (DSMQ), and instruments to measure quality of life used quality of life WHOQOL-BREEF. The data obtained were processed statistically by using spearman rank test formula and p value of 0,000 There is a significant relationship of self management with the quality of life of people with diabetes mellitus. Penderita Diabetes mellitus beresiko mengalami komplikasi yang dapat mempengaruhi kualitas hidupnya. Komplikasi tersebut dapat diminimalkan melalui manajemen perawatan diri (self management). Penelitian ini bertujuan untuk menganalisis hubungan self management dengan kualitas hidup pasien diabetes melitus. Jenis penelitian ini adalah deskriptif korelasi dengan desain cross sectional. Teknik pengambilan sampel menggunakan non probability sampling dengan pendekatan consecutive sampling. Jumlah sampel sebanyak 118 responden. Instrumen penelitian untuk mengukur self management menggunakan diabetes self management questionnaire (DSMQ), dan instrumen untuk mengukur kualitas hidup menggunakan quality of life WHOQOL-BREEF. Analisis data menggunakan spearman rank dan didapatkan hasil nilai p value 0,000 dan r 0,394.Terdapat hubungan antara self management dengan kualitas hidup pasien diabetes mellitus dengan arah korelasi positif.

Download Full-text

Correlation of AGR2 expression with the incidence of metastasis in luminal breast cancer

Breast Disease ◽

10.3233/bd-219015 ◽

2021 ◽

pp. 1-5

Author(s):

David Samuel Kereh ◽

John Pieter ◽

William Hamdani ◽

Haryasena Haryasena ◽

Daniel Sampepajung ◽

...

Keyword(s):

Breast Cancer ◽

Cancer Patients ◽

Mean Value ◽

Rank Test ◽

Luminal Breast Cancer ◽

P Value ◽

Strong Positive Correlation ◽

Breast Cancer Patients ◽

The Mean ◽

Spearman Test

BACKGROUND: AGR2 expression is associated with luminal breast cancer. Overexpression of AGR2 is a predictor of poor prognosis. Several studies have found correlations between AGR2 in disseminated tumor cells (DTCs) in breast cancer patients. OBJECTIVE: This study aims to determine the correlation between anterior Gradient2 (AGR2) expression with the incidence of distant metastases in luminal breast cancer. METHODS: This study was an observational study using a cross-sectional method and was conducted at Wahidin Sudirohusodo Hospital and the network. ELISA methods examine AGR2 expression from blood serum of breast cancer patients. To compare the AGR2 expression in metastatic patients and the non-metastatic patient was tested with Mann Whitney test. The correlation of AGR2 expression and metastasis was tested with the Rank Spearman test. RESULTS: The mean value of AGR2 antibody expression on ELISA in this study was 2.90 ± 1.82 ng/dl, and its cut-off point was 2.1 ng/dl. Based on this cut-off point value, 14 subjects (66.7%) had overexpression of AGR2 serum ELISA, and 7 subjects (33.3%) had not. The mean value AGR2 was significantly higher in metastatic than not metastatic, 3.77 versus 1.76 (p < 0.01). The Spearman rank test obtained a p-value for the 2 tail test of 0.003 (p < 0.05), which showed a significant correlation of both, while the correlation coefficient of 0.612 showed a strong positive correlation of AGR2 overexpression and metastasis. CONCLUSIONS: AGR2 expression is correlated with metastasis in Luminal breast cancer.

Download Full-text

A Comparison of Machine Learning Methods for Data Imputation

11th Hellenic Conference on Artificial Intelligence ◽

10.1145/3411408.3411465 ◽

2020 ◽

Author(s):

Christos Platias ◽

Georgios Petasis

Keyword(s):

Machine Learning ◽

Data Imputation ◽

Learning Methods ◽

Machine Learning Methods

Download Full-text

New Hybrid Approach for Developing Automated Machine Learning Workflows: A Real Case Application in Evaluation of Marcellus Shale Gas Production

Fuels ◽

10.3390/fuels2030017 ◽

2021 ◽

Vol 2 (3) ◽

pp. 286-303

Author(s):

Vuong Van Pham ◽

Ebrahim Fathi ◽

Fatemeh Belyadi

Keyword(s):

Artificial Intelligence ◽

Machine Learning ◽

Shale Gas ◽

Field Data ◽

Marcellus Shale ◽

Hybrid Approach ◽

Gas Production ◽

Bayesian Optimization ◽

Real Field ◽

Engineering Problems

The success of machine learning (ML) techniques implemented in different industries heavily rely on operator expertise and domain knowledge, which is used in manually choosing an algorithm and setting up the specific algorithm parameters for a problem. Due to the manual nature of model selection and parameter tuning, it is impossible to quantify or evaluate the quality of this manual process, which in turn limits the ability to perform comparison studies between different algorithms. In this study, we propose a new hybrid approach for developing machine learning workflows to help automated algorithm selection and hyperparameter optimization. The proposed approach provides a robust, reproducible, and unbiased workflow that can be quantified and validated using different scoring metrics. We have used the most common workflows implemented in the application of artificial intelligence (AI) and ML in engineering problems including grid/random search, Bayesian search and optimization, genetic programming, and compared that with our new hybrid approach that includes the integration of Tree-based Pipeline Optimization Tool (TPOT) and Bayesian optimization. The performance of each workflow is quantified using different scoring metrics such as Pearson correlation (i.e., R2 correlation) and Mean Square Error (i.e., MSE). For this purpose, actual field data obtained from 1567 gas wells in Marcellus Shale, with 121 features from reservoir, drilling, completion, stimulation, and operation is tested using different proposed workflows. A proposed new hybrid workflow is then used to evaluate the type well used for evaluation of Marcellus shale gas production. In conclusion, our automated hybrid approach showed significant improvement in comparison to other proposed workflows using both scoring matrices. The new hybrid approach provides a practical tool that supports the automated model and hyperparameter selection, which is tested using real field data that can be implemented in solving different engineering problems using artificial intelligence and machine learning. The new hybrid model is tested in a real field and compared with conventional type wells developed by field engineers. It is found that the type well of the field is very close to P50 predictions of the field, which shows great success in the completion design of the field performed by field engineers. It also shows that the field average production could have been improved by 8% if shorter cluster spacing and higher proppant loading per cluster were used during the frac jobs.

Download Full-text

POS0619 MODELLING OF DISEASE ACTIVITY IN PATIENTS WITH INFLAMMATORY ARTHROPATHIES TREATED WITH ETANERCEPT ORIGINATOR OR BIOSIMILAR AS FIRST-LINE BIOLOGIC IN AN AUSTRALIAN REAL-WORLD DATASET

Annals of the Rheumatic Diseases ◽

10.1136/annrheumdis-2021-eular.2265 ◽

2021 ◽

Vol 80 (Suppl 1) ◽

pp. 547.1-547

Author(s):

C. Deakin ◽

G. Littlejohn ◽

H. Griffiths ◽

T. Smith ◽

C. Osullivan ◽

...

Keyword(s):

Disease Activity ◽

Clinical Data ◽

Real World ◽

Rank Test ◽

P Value ◽

Continuous Variables ◽

Linear Quadratic ◽

First Line ◽

Modest Improvement ◽

Over Time

Background:The availability of biosimilars as non-proprietary versions of established biologic disease-modifying anti-rheumatic drugs (bDMARDs) is enabling greater access for patients with rheumatic diseases to effective medications at a lower cost. Since April 2017 both the originator and a biosimilar for etanercept (trade names Enbrel and Brenzys, respectively) have been available for use in Australia.Objectives:[1]To model effectiveness of etanercept originator or biosimilar in reducing Disease Activity Score 28-joint count C reactive protein (DAS28CRP) in patients with rheumatoid arthritis (RA), psoriatic arthritis (PsA) or ankylosing spondylitis (AS) treated with either drug as first-line bDMARD[2]To describe persistence on etanercept originator or biosimilar as first-line bDMARD in patients with RA, PsA or ASMethods:Clinical data were obtained from the Optimising Patient outcomes in Australian rheumatoLogy (OPAL) dataset, derived from electronic medical records. Eligible patients with RA, PsA or AS who initiated etanercept originator (n=856) or biosimilar (n=477) as first-line bDMARD between 1 April 2017 and 31 December 2020 were identified. Propensity score matching was performed to select patients on originator (n=230) or biosimilar (n=136) with similar characteristics in terms of diagnosis, disease duration, joint count, age, sex and concomitant medications. Data on clinical outcomes were recorded at 3 months after baseline, and then at 6-monthly intervals. Outcomes data that were missing at a recorded visit were imputed.Effectiveness of the originator, relative to the biosimilar, for reducing DAS28CRP over time was modelled in the matched population using linear mixed models with both random intercepts and slopes to allow for individual heterogeneity, and weighting of individuals by inverse probability of treatment weights to ensure comparability between treatment groups. Time was modelled as a combination of linear, quadratic and cubic continuous variables.Persistence on the originator or biosimilar was analysed using survival analysis (log-rank test).Results:Reduction in DAS28CRP was associated with both time and etanercept originator treatment (Table 1). The conditional R-squared for the model was 0.31. The average predicted DAS28CRP at baseline, 3 months, 6 months, 9 months and 12 months were 4.0 and 4.4, 3.1 and 3.4, 2.6 and 2.8, 2.3 and 2.6, and 2.2 and 2.4 for the originator and biosimilar, respectively, indicating a clinically meaningful effect of time for patients on either drug and an additional modest improvement for patients on the originator.Median time to 50% of patients stopping treatment was 25.5 months for the originator and 24.1 months for the biosimilar (p=0.53). An adverse event was the reason for discontinuing treatment in 33 patients (14.5%) on the originator and 18 patients (12.9%) on the biosimilar.Conclusion:Analysis using a large national real-world dataset showed treatment with either the etanercept originator or the biosimilar was associated with a reduction in DAS28CRP over time, with the originator being associated with a further modest reduction in DAS28CRP that was not clinically significant. Persistence on treatment was not different between the two drugs.Table 1.Respondent characteristics.Fixed EffectEstimate95% Confidence Intervalp-valueTime (linear)0.900.89, 0.911.5e-63Time (quadratic)1.011.00, 1.011.3e-33Time (cubic)1.001.00, 1.007.1e-23Originator0.910.86, 0.960.0013Acknowledgements:The authors acknowledge the members of OPAL Rheumatology Ltd and their patients for providing clinical data for this study, and Software4Specialists Pty Ltd for providing the Audit4 platform.Supported in part by a research grant from Investigator-Initiated Studies Program of Merck & Co Inc, Kenilworth, NJ, USA. The opinions expressed in this paper are those of the authors and do not necessarily represent those of Merck & Co Inc, Kenilworth, NJ, USA.Disclosure of Interests:Claire Deakin: None declared, Geoff Littlejohn Consultant of: Over the last 5 years Geoffrey Littlejohn has received educational grants and consulting fees from AbbVie, Bristol Myers Squibb, Eli Lilly, Gilead, Novartis, Pfizer, Janssen, Sandoz, Sanofi and Seqirus., Hedley Griffiths Consultant of: AbbVie, Gilead, Novartis and Lilly., Tegan Smith: None declared, Catherine OSullivan: None declared, Paul Bird Speakers bureau: Eli Lilly, abbvie, pfizer, BMS, UCB, Gilead, Novartis

Download Full-text