scholarly journals Detection and Identification of Organic Pollutants in Drinking Water from Fluorescence Spectra Based on Deep Learning Using Convolutional Autoencoder

Water ◽  
2021 ◽  
Vol 13 (19) ◽  
pp. 2633
Author(s):  
Jie Yu ◽  
Yitong Cao ◽  
Fei Shi ◽  
Jiegen Shi ◽  
Dibo Hou ◽  
...  

Three dimensional fluorescence spectroscopy has become increasingly useful in the detection of organic pollutants. However, this approach is limited by decreased accuracy in identifying low concentration pollutants. In this research, a new identification method for organic pollutants in drinking water is accordingly proposed using three-dimensional fluorescence spectroscopy data and a deep learning algorithm. A novel application of a convolutional autoencoder was designed to process high-dimensional fluorescence data and extract multi-scale features from the spectrum of drinking water samples containing organic pollutants. Extreme Gradient Boosting (XGBoost), an implementation of gradient-boosted decision trees, was used to identify the organic pollutants based on the obtained features. Method identification performance was validated on three typical organic pollutants in different concentrations for the scenario of accidental pollution. Results showed that the proposed method achieved increasing accuracy, in the case of both high-(>10 μg/L) and low-(≤10 μg/L) concentration pollutant samples. Compared to traditional spectrum processing techniques, the convolutional autoencoder-based approach enabled obtaining features of enhanced detail from fluorescence spectral data. Moreover, evidence indicated that the proposed method maintained the detection ability in conditions whereby the background water changes. It can effectively reduce the rate of misjudgments associated with the fluctuation of drinking water quality. This study demonstrates the possibility of using deep learning algorithms for spectral processing and contamination detection in drinking water.

Author(s):  
Ruopeng Xie ◽  
Jiahui Li ◽  
Jiawei Wang ◽  
Wei Dai ◽  
André Leier ◽  
...  

Abstract Virulence factors (VFs) enable pathogens to infect their hosts. A wealth of individual, disease-focused studies has identified a wide variety of VFs, and the growing mass of bacterial genome sequence data provides an opportunity for computational methods aimed at predicting VFs. Despite their attractive advantages and performance improvements, the existing methods have some limitations and drawbacks. Firstly, as the characteristics and mechanisms of VFs are continually evolving with the emergence of antibiotic resistance, it is more and more difficult to identify novel VFs using existing tools that were previously developed based on the outdated data sets; secondly, few systematic feature engineering efforts have been made to examine the utility of different types of features for model performances, as the majority of tools only focused on extracting very few types of features. By addressing the aforementioned issues, the accuracy of VF predictors can likely be significantly improved. This, in turn, would be particularly useful in the context of genome wide predictions of VFs. In this work, we present a deep learning (DL)-based hybrid framework (termed DeepVF) that is utilizing the stacking strategy to achieve more accurate identification of VFs. Using an enlarged, up-to-date dataset, DeepVF comprehensively explores a wide range of heterogeneous features with popular machine learning algorithms. Specifically, four classical algorithms, including random forest, support vector machines, extreme gradient boosting and multilayer perceptron, and three DL algorithms, including convolutional neural networks, long short-term memory networks and deep neural networks are employed to train 62 baseline models using these features. In order to integrate their individual strengths, DeepVF effectively combines these baseline models to construct the final meta model using the stacking strategy. Extensive benchmarking experiments demonstrate the effectiveness of DeepVF: it achieves a more accurate and stable performance compared with baseline models on the benchmark dataset and clearly outperforms state-of-the-art VF predictors on the independent test. Using the proposed hybrid ensemble model, a user-friendly online predictor of DeepVF (http://deepvf.erc.monash.edu/) is implemented. Furthermore, its utility, from the user’s viewpoint, is compared with that of existing toolkits. We believe that DeepVF will be exploited as a useful tool for screening and identifying potential VFs from protein-coding gene sequences in bacterial genomes.


2021 ◽  
Author(s):  
Ayumi Koyama ◽  
Dai Miyazaki ◽  
Yuji Nakagawa ◽  
Yuji Ayatsuka ◽  
Hitomi Miyake ◽  
...  

Abstract Corneal opacities are an important cause of blindness, and its major etiology is infectious keratitis. Slit-lamp examinations are commonly used to determine the causative pathogen; however, their diagnostic accuracy is low even for experienced ophthalmologists. To characterize the “face” of an infected cornea, we have adapted a deep learning architecture used for facial recognition and applied it to determine a probability score for a specific pathogen causing keratitis. To record the diverse features and mitigate the uncertainty, batches of probability scores of 4 serial images taken from many angles or fluorescence staining were learned for score and decision level fusion using a gradient boosting decision tree. A total of 4306 slit-lamp images and 312 images obtained by internet publications on keratitis by bacteria, fungi, acanthamoeba, and herpes simplex virus (HSV) were studied. The created algorithm had a high overall accuracy of diagnosis, e.g., the accuracy/area under the curve (AUC) for acanthamoeba was 97.9%/0.995, bacteria was 90.7%/0.963, fungi was 95.0%/0.975, and HSV was 92.3%/0.946, by group K-fold validation, and it was robust to even the low resolution web images. We suggest that our hybrid deep learning-based algorithm be used as a simple and accurate method for computer-assisted diagnosis of infectious keratitis.


West Nile Virus (WNV) is a disease caused by mosquitoes where human beings get infected by the mosquito’s bite. The disease is considered to be a serious threat to the society especially in the United States where it is frequently found in localities having water bodies. The traditional approach is to collect the traps of mosquitoes from a locality and check whether they are infected with virus. If there is a virus found then that locality is sprayed with pesticides. But this process is very time consuming and requires a lot of financial support. Machine learning methods can provide an efficient approach to predict the presence of virus in a locality using data related to the location and weather. This paper uses the dataset present in Kaggle which includes information related to the traps found in the locality and also about the information related to the locality’s weather. The dataset is found to be imbalanced hence Synthetic Minority Over sampling Technique (SMOTE), an upsampling method, is used to sample the dataset to balance it. Ensemble learning classifiers like random forest, gradient boosting and Extreme Gradient Boosting (XGB). The performance of ensemble classifiers is compared with the performance of the best supervised learning algorithm, SVM. Among the models, XGB gave the highest F-1 score of 92.93 by performing marginally better than random forest (92.78) and also SVM (91.16).


2020 ◽  
Vol 7 (4) ◽  
pp. 807
Author(s):  
Siti Mutrofin ◽  
M. Mughniy Machfud ◽  
Diema Hernyka Satyareni ◽  
Raden Venantius Hari Ginardi ◽  
Chastine Fatichah

<p class="Abstrak">Penentuan jurusan di SMA Negeri 1 Jogoroto, Jombang, Jawa Timur menggunakan kurikulum 2013, di mana penentuan jurusan siswa tidak hanya melibatkan keinginan siswa, tes peminatan yang dilakukan siswa di SMA pada minggu pertama, tetapi juga dilengkapi dengan nilai siswa semasa di SMP (nilai rapor siswa, nilai Ujian Nasional, serta rekomendasi guru Bimbingan Konseling), rekomendasi orang tua siswa. Selama ini, sekolah menggunakan proses konvensional dalam menentukan jurusan, yaitu menggunakan Microsoft Excel, yang cenderung lama serta rawan akan kekeliruan dalam melakukan penghitungan. Penentuan jurusan ini dilakukan setiap awal ajaran baru pada siswa baru kelas X. Rata-rata setiap tahun, sekolah mengelola siswa sejumlah 290 dengan waktu dan sumber daya manusia yang terbatas. Pada penelitian ini, penggunaan algoritma ID3 tidak cocok karena data bertipe numerik, sedangkan ID3 hanya mampu menggunakan data bertipe nomial maupun polinomial, sehingga diganti algoritma C4.5. Namun, beberapa penelitian mengatakan algoritma C4.5 memiliki kinerja kurang bagus dibandingkan algoritma <em>Gradient Boosting Trees</em>, <em>Random Forests</em>, dan <em>Deep Learning</em>. Untuk itu, dilakukan perbandingan antara keempat metode tersebut untuk melihat keefektifannya dalam menentukan jurusan di SMA. Data yang digunakan pada penelitian ini adalah data penerimaan siswa baru tahun ajaran 2018/2019. Hasil dari penelitian ini menunjukkan jika atribut yang digunakan bertipe polinomial dengan <em>Deep Learning </em>memiliki kinerja paling unggul untuk semua algoritma jika menggunakan fungsi <em>activation</em> ExpRectifier. Sedangkan jika atributnya bertipe numerik, <em>Deep Learning</em> memiliki kinerja paling unggul untuk semua algoritma jika menggunakan fungsi Tanh untuk semua <em>random sampling</em>. Namun, <em>Deep Learning</em> memiliki kinerja paling buruk untuk semua algoritma jika menggunakan <em>loss Function</em> berupa absolut.</p><p class="Abstrak" align="center"> </p><p class="Judul"> </p><p class="Judul2"><strong><em>Abstract</em></strong></p><p class="Judul2"><strong> </strong></p><em>In SMAN 1 Jombang, East Java, the process of determining the students’ majors referred to the 2013 curriculum in which not only the students’ own choices and specialization tests conducted in their first week of SMA were considered but also the student’s SMP grades (a report card, UN scores, and counseling teacher’s recommendation) and parents' recommendation. So far, the school had used Microsoft Excel which required a long time to do and was prone to calculation errors in the process of determination. The process was carried out, with limited time and human resources, at the beginning of a new academic year for grade X students, consisting of 290 students on average. In this present research, the use of ID3 algorithm was not suitable because of its numeric data type instead of nominal or polynomial data. Thus, the C4.5 algorithm was applied, instead. However, the performance of C4.5 algorithm was proved lower than the algorithms of Gradient Boosting Trees, Random Forests, and Deep Learning. Hence, a comparison of performance between them was done to see their effectiveness in the process. The data was the list of new students of the academic year 2018/2019. The results showed that if the attributes are polynomial, the Deep Learning algorithm had the best performance when using the ExpRectifier activation function. When they were numeric, Deep Learning has the most superior performance when using the Tanh function. However, Deep Learning has the worst performance when using the loss function in the form of absolute.</em>


Information ◽  
2021 ◽  
Vol 12 (11) ◽  
pp. 486
Author(s):  
Xiaoyan Zhang ◽  
Qiang Yan ◽  
Simin Zhou ◽  
Linye Ma ◽  
Siran Wang

The number of consumers playing virtual reality games is booming. To speed up product iteration, the user experience team needs to collect and analyze unsatisfying experiences in time. In this paper, we aim to detect the unsatisfying experiences hidden in online reviews of virtual reality exergames using a deep learning method and find out the unmet psychological needs of users based on self-determination theory. Convolutional neural networks for sentence classification (textCNN) are used in this study to classify online reviews with unsatisfying experiences. For comparison, we set eXtreme gradient boosting (XGBoost) with lexical features as the baseline of machine learning. Term frequency-inverse document frequency (TF-IDF) is used to extract keywords from every set of classified reviews. The micro-F1 score of textCNN classifier is 90.00, which is better than 82.69 of XGBoost. The top 10 keywords of every set of reviews reflect relevant topics of unmet psychological needs. This paper explores the potential problems causing unsatisfying experiences and unmet psychological needs in virtual reality exergames through text mining and makes a supplement for experimental studies about virtual reality exergames.


Author(s):  
He Yang ◽  
Emma Li ◽  
Yi Fang Cai ◽  
Jiapei Li ◽  
George X. Yuan

The purpose of this paper is to establish a framework for the extraction of early warning risk features for the predicting financial distress based on XGBoost model and SHAP. It is well known that the way to construct early warning risk features to predict financial distress of companies is very important, and by comparing with the traditional statistical methods, though the data-driven machine learning for the financial early warning, modelling has a better performance in terms of prediction accuracy, but it also brings the difficulty such as the one the corresponding model may be not explained well. Recently, eXtreme Gradient Boosting (XGBoost), an ensemble learning algorithm based on extreme gradient boosting, has become a hot topic in the area of machine learning research field due to its strong nonlinear information recognition ability and high prediction accuracy in the practice. In this study, the XGBoost algorithm is used to extract early warning features for the predicting financial distress for listed companies, with 76 financial risk features from seven categories of aspects, and 14 non-financial risk features from four categories of aspects, which are collected to establish an early warning system for the predication of financial distress. With applications, we conduct the empirical testing respect to AUC, KS and Kappa, the numerical results show that by comparing with the Logistic model, our method based on XGBoost model established in this paper has much better ability to predict the financial distress risk of listed companies. Moreover, under the framework of SHAP (SHAPley Additive exPlanations), we are able to give a reasonable explanation for important risk features and influencing ways affecting the financial distress visibly. The results given by this paper show that the XGBoost approach to model early warning features for financial distress does not only preform a better prediction accuracy, but also is explainable, which is significant for the identification of early warning to the financial distress risk for listed companies in the practice.


2021 ◽  
Vol 10 (9) ◽  
pp. 1875
Author(s):  
I-Min Chiu ◽  
Chi-Yung Cheng ◽  
Wun-Huei Zeng ◽  
Ying-Hsien Huang ◽  
Chun-Hung Richard Lin

Background: The aim of this study was to develop and evaluate a machine learning (ML) model to predict invasive bacterial infections (IBIs) in young febrile infants visiting the emergency department (ED). Methods: This retrospective study was conducted in the EDs of three medical centers across Taiwan from 2011 to 2018. We included patients age in 0–60 days who were visiting the ED with clinical symptoms of fever. We developed three different ML algorithms, including logistic regression (LR), supportive vector machine (SVM), and extreme gradient boosting (XGboost), comparing their performance at predicting IBIs to a previous validated score system (IBI score). Results: During the study period, 4211 patients were included, where 126 (3.1%) had IBI. A total of eight, five, and seven features were used in the LR, SVM, and XGboost through the feature selection process, respectively. The ML models can achieve a better AUROC value when predicting IBIs in young infants compared with the IBI score (LR: 0.85 vs. SVM: 0.84 vs. XGBoost: 0.85 vs. IBI score: 0.70, p-value < 0.001). Using a cost sensitive learning algorithm, all ML models showed better specificity in predicting IBIs at a 90% sensitivity level compared to an IBI score > 2 (LR: 0.59 vs. SVM: 0.60 vs. XGBoost: 0.57 vs. IBI score >2: 0.43, p-value < 0.001). Conclusions: All ML models developed in this study outperformed the traditional scoring system in stratifying low-risk febrile infants after the standardized sensitivity level.


Atmosphere ◽  
2019 ◽  
Vol 10 (7) ◽  
pp. 373 ◽  
Author(s):  
Mehdi Zamani Joharestani ◽  
Chunxiang Cao ◽  
Xiliang Ni ◽  
Barjeece Bashir ◽  
Somayeh Talebiesfandarani

In recent years, air pollution has become an important public health concern. The high concentration of fine particulate matter with diameter less than 2.5 µm (PM2.5) is known to be associated with lung cancer, cardiovascular disease, respiratory disease, and metabolic disease. Predicting PM2.5 concentrations can help governments warn people at high risk, thus mitigating the complications. Although attempts have been made to predict PM2.5 concentrations, the factors influencing PM2.5 prediction have not been investigated. In this work, we study feature importance for PM2.5 prediction in Tehran’s urban area, implementing random forest, extreme gradient boosting, and deep learning machine learning (ML) approaches. We use 23 features, including satellite and meteorological data, ground-measured PM2.5, and geographical data, in the modeling. The best model performance obtained was R2 = 0.81 (R = 0.9), MAE = 9.93 µg/m3, and RMSE = 13.58 µg/m3 using the XGBoost approach, incorporating elimination of unimportant features. However, all three ML methods performed similarly and R2 varied from 0.63 to 0.67, when Aerosol Optical Depth (AOD) at 3 km resolution was included, and 0.77 to 0.81, when AOD at 3 km resolution was excluded. Contrary to the PM2.5 lag data, satellite-derived AODs did not improve model performance.


Sign in / Sign up

Export Citation Format

Share Document