A Review of Biohydrogen Productions from Lignocellulosic Precursor via Dark Fermentation: Perspective on Hydrolysate Composition and Electron-Equivalent Balance

Yiyang Liu; Jingluo Min; Xingyu Feng; Yue He; Jinze Liu; Yixiao Wang; Jun He; Hainam Do; Valérie Sage; Gang Yang; Yong Sun

doi:10.3390/en13102451

A Review of Biohydrogen Productions from Lignocellulosic Precursor via Dark Fermentation: Perspective on Hydrolysate Composition and Electron-Equivalent Balance

Energies ◽

10.3390/en13102451 ◽

2020 ◽

Vol 13 (10) ◽

pp. 2451 ◽

Cited By ~ 1

Author(s):

Yiyang Liu ◽

Jingluo Min ◽

Xingyu Feng ◽

Yue He ◽

Jinze Liu ◽

...

Keyword(s):

Technological Development ◽

Carbon Sources ◽

Dark Fermentation ◽

Continuous Operation ◽

Training Data ◽

Supervised Machine Learning ◽

Hydrogen Yield ◽

Data Set ◽

Microbial Strains ◽

Electron Equivalent

This paper reviews the current technological development of bio-hydrogen (BioH2) generation, focusing on using lignocellulosic feedstock via dark fermentation (DF). Using the collected reference reports as the training data set, supervised machine learning via the constructed artificial neuron networks (ANNs) imbedded with feed backward propagation and one cross-out validation approach was deployed to establish correlations between the carbon sources (glucose and xylose) together with the inhibitors (acetate and other inhibitors, such as furfural and aromatic compounds), hydrogen yield (HY), and hydrogen evolution rate (HER) from reported works. Through the statistical analysis, the concentrations variations of glucose (F-value = 0.0027) and acetate (F-value = 0.0028) were found to be statistically significant among the investigated parameters to HY and HER. Manipulating the ratio of glucose to acetate at an optimal range (approximate in 14:1) will effectively improve the BioH2 generation (HY and HER) regardless of microbial strains inoculated. Comparative studies were also carried out on the evolutions of electron equivalent balances using lignocellulosic biomass as substrates for BioH2 production across different reported works. The larger electron sinks in the acetate is found to be appreciably related to the higher HY and HER. To maintain a relative higher level of the BioH2 production, the biosynthesis needs to be kept over 30% in batch cultivation, while the biosynthesis can be kept at a low level (2%) in the continuous operation among the investigated reports. Among available solutions for the enhancement of BioH2 production, the selection of microbial strains with higher capacity in hydrogen productions is still one of the most phenomenal approaches in enhancing BioH2 production. Other process intensifications using continuous operation compounded with synergistic chemical additions could deliver additional enhancement for BioH2 productions during dark fermentation.

Download Full-text

Lost in Space: Geolocation in Event Data

Political Science Research and Methods ◽

10.1017/psrm.2018.23 ◽

2018 ◽

Vol 7 (04) ◽

pp. 871-888 ◽

Cited By ~ 6

Author(s):

Sophie J. Lee ◽

Howard Liu ◽

Michael D. Ward

Keyword(s):

Learning Algorithm ◽

Text Processing ◽

Contextual Information ◽

Training Data ◽

Supervised Machine Learning ◽

Model Parameters ◽

Event Data ◽

Data Set ◽

N Gram ◽

Automated Text Processing

Improving geolocation accuracy in text data has long been a goal of automated text processing. We depart from the conventional method and introduce a two-stage supervised machine-learning algorithm that evaluates each location mention to be either correct or incorrect. We extract contextual information from texts, i.e., N-gram patterns for location words, mention frequency, and the context of sentences containing location words. We then estimate model parameters using a training data set and use this model to predict whether a location word in the test data set accurately represents the location of an event. We demonstrate these steps by constructing customized geolocation event data at the subnational level using news articles collected from around the world. The results show that the proposed algorithm outperforms existing geocoders even in a case added post hoc to test the generality of the developed algorithm.

Download Full-text

Twitter Sentiment Recognition using Support Vector Machine

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.d9414.118419 ◽

2019 ◽

Vol 8 (4) ◽

pp. 8797-8801

Keyword(s):

Learning Algorithm ◽

Training Data ◽

Supervised Machine Learning ◽

Support Vector ◽

Data Set ◽

Political Views ◽

Language Knowledge ◽

Processing Techniques ◽

And Training

In this we explore the effectiveness of language features to identify Twitter messages ' feelings. We assess the utility of existing lexical tools as well as capturing features of informal and innovative language knowledge used in micro blogging. We take a supervised approach to the problem, but to create training data, we use existing hash tags in the Twitter data. We Using three separate Twitter messaging companies in our experiments. We use the hash tagged data set (HASH) for development and training, which we compile from the Edinburgh Twitter corpus, and the emoticon data set (EMOT) from the I Sieve Corporation (ISIEVE) for evaluation. Twitter contains huge amount of data . This data may be of different types such as structured data or unstructured data. So by using this data and Appling pre processing techniques we can be able to read the comments from the users. And also the comments will be classified into three categories. They are positive negative and also the neutral comments.Today they use the processing of natural language, information, and text interpretation to derive and classify text feeling into pos itive, negative, and neutral categories. We can also examine the utility of language features to identify Twitter mess ages ' feelings. In addition, state-of - the-art approaches take into consideration only the tweet to be classified when classifying the feeling; they ignore its context (i.e. related tweets).Since tweets are usually short and more ambiguous, however, it is sometimes not enough to consider only the current tweet for classification of sentiments.Informal and innovative microblogging language. We take a sup ervised approach to the problem, but to create training data, we use existing hashtags in the Twitter data.This paper also contrasts sentiment analysis approaches in evaluating political views using Naïve Bayes supervised machine learning algorithm which performs in better analysis compared to other techniques Paper

Download Full-text

The impact of using large training data set KDD99 on classification accuracy

10.7287/peerj.preprints.2838v1 ◽

2017 ◽

Author(s):

Atilla Özgür ◽

Hamit Erdem

Keyword(s):

Machine Learning ◽

Performance Metrics ◽

Training Data ◽

Supervised Machine Learning ◽

Support Vector ◽

Data Set ◽

Test Dataset ◽

Negative Rate ◽

Positive Rate ◽

The Impact

This study investigates the effects of using a large data set on supervised machine learning classifiers in the domain of Intrusion Detection Systems (IDS). To investigate this effect 12 machine learning algorithms have been applied. These algorithms are: (1) Adaboost, (2) Bayesian Nets, (3) Decision Tables, (4) Decision Trees (J48), (5)Logistic Regression, (6) Multi-Layer Perceptron, (7) Naive Bayes, (8) OneRule, (9)Random Forests, (10) Radial Basis Function Neural Networks, (11) Support Vector Machines (two different training algorithms), and (12) ZeroR. A well-known IDS benchmark dataset, KDD99 has been used to train and test classifiers. Full training data set of KDD99 is 4.9 million instances while full test dataset is 311,000 instances. In contrast to similar previous studies, which used 0.08%–10% for training and 1.2%–100% for testing, this study uses full training dataset and full test dataset. Weka Machine Learning Toolbox has been used for modeling and simulation. The performance of classifiers has been evaluated using standard binary performance metrics: Detection Rate, True Positive Rate, True Negative Rate, False Positive Rate, False Negative Rate, Precision, and F1-Rate. To show effects of dataset size, performance of classifiers has been also evaluated using following hardware metrics: Training Time, Working Memory and Model Size. Test results shows improvements in classifiers in standard performance metrics compared to previous studies.

Download Full-text

Probabilistic cosmic web classification using fast-generated training data

Monthly Notices of the Royal Astronomical Society ◽

10.1093/mnras/staa2008 ◽

2020 ◽

Vol 497 (4) ◽

pp. 5041-5060

Author(s):

Brandon Buncher ◽

Matias Carrasco Kind

Keyword(s):

Learning Algorithm ◽

Local Density ◽

Density Field ◽

Training Data ◽

Three Dimensions ◽

Supervised Machine Learning ◽

Data Generation ◽

Data Set ◽

Field Magnitude ◽

Web Classification

ABSTRACT We present a novel method of robust probabilistic cosmic web particle classification in three dimensions using a supervised machine learning algorithm. Training data were generated using a simplified ΛCDM toy model with pre-determined algorithms for generating haloes, filaments, and voids. While this framework is not constrained by physical modelling, it can be generated substantially more quickly than an N-body simulation without loss in classification accuracy. For each particle in this data set, measurements were taken of the local density field magnitude and directionality. These measurements were used to train a random forest algorithm, which was used to assign class probabilities to each particle in a ΛCDM, dark matter-only N-body simulation with 2563 particles, as well as on another toy model data set. By comparing the trends in the ROC curves and other statistical metrics of the classes assigned to particles in each data set using different feature sets, we demonstrate that the combination of measurements of the local density field magnitude and directionality enables accurate and consistent classification of halo, filament, and void particles in varied environments. We also show that this combination of training features ensures that the construction of our toy model does not affect classification. The use of a fully supervised algorithm allows greater control over the information deemed important for classification, preventing issues arising from arbitrary hyperparameters and mode collapse in deep learning models. Due to the speed of training data generation, our method is highly scalable, making it particularly suited for classifying large data sets, including observed data.

Download Full-text

Boosted Supervised Intensional Learning Supported by Unsupervised Learning

International Journal of Machine Learning and Computing ◽

10.18178/ijmlc.2021.11.2.1020 ◽

2021 ◽

Vol 11 (2) ◽

pp. 98-102

Author(s):

A. C. M. Fong ◽

◽

G. Hong

Keyword(s):

Unsupervised Learning ◽

Training Data ◽

Supervised Machine Learning ◽

Data Sets ◽

Preliminary Evaluation ◽

Data Set ◽

Domain Experts ◽

Large Sets ◽

Accuracy Result ◽

Similar Accuracy

Traditionally, supervised machine learning (ML) algorithms rely heavily on large sets of annotated data. This is especially true for deep learning (DL) neural networks, which need huge annotated data sets for good performance. However, large volumes of annotated data are not always readily available. In addition, some of the best performing ML and DL algorithms lack explainability – it is often difficult even for domain experts to interpret the results. This is an important consideration especially in safety-critical applications, such as AI-assisted medical endeavors, in which a DL’s failure mode is not well understood. This lack of explainability also increases the risk of malicious attacks by adversarial actors because these actions can become obscured in the decision-making process that lacks transparency. This paper describes an intensional learning approach which uses boosting to enhance prediction performance while minimizing reliance on availability of annotated data. The intensional information is derived from an unsupervised learning preprocessing step involving clustering. Preliminary evaluation on the MNIST data set has shown encouraging results. Specifically, using the proposed approach, it is now possible to achieve similar accuracy result as extensional learning alone while using only a small fraction of the original training data set.

Download Full-text

Use of artificial intelligence to estimate population health indicators in France

European Journal of Public Health ◽

10.1093/eurpub/ckaa165.267 ◽

2020 ◽

Vol 30 (Supplement_5) ◽

Author(s):

R Haneef ◽

S Fuentes ◽

R Hrzic ◽

S Fosse-Edorh ◽

S Kab ◽

...

Keyword(s):

Artificial Intelligence ◽

Machine Learning ◽

Training Data ◽

Supervised Machine Learning ◽

Data Sets ◽

Data Set ◽

Selection Of

Abstract Background The use of artificial intelligence is increasing to estimate and predict health outcomes from large data sets. The main objectives were to develop two algorithms using machine learning techniques to identify new cases of diabetes (case study I) and to classify type 1 and type 2 (case study II) in France. Methods We selected the training data set from a cohort study linked with French national Health database (i.e., SNDS). Two final datasets were used to achieve each objective. A supervised machine learning method including eight following steps was developed: the selection of the data set, case definition, coding and standardization of variables, split data into training and test data sets, variable selection, training, validation and selection of the model. We planned to apply the trained models on the SNDS to estimate the incidence of diabetes and the prevalence of type 1/2 diabetes. Results For the case study I, 23/3468 and for case study II, 14/3481 SNDS variables were selected based on an optimal balance between variance explained and using the ReliefExp algorithm. We trained four models using different classification algorithms on the training data set. The Linear Discriminant Analysis model performed best in both case studies. The models were assessed on the test datasets and achieved a specificity of 67% and a sensitivity of 62% in case study I, and a specificity of 97 % and sensitivity of 100% in case study II. The case study II model was applied to the SNDS and estimated the prevalence of type 1 diabetes in 2016 in France of 0.3% and for type 2, 4.4%. The case study model I was not applied to the SNDS. Conclusions The case study II model to estimate the prevalence of type 1/2 diabetes has good performance and will be used in routine surveillance. The case study I model to identify new cases of diabetes showed a poor performance due to missing necessary information on determinants of diabetes and will need to be improved for further research.

Download Full-text

The impact of using large training data set KDD99 on classification accuracy

10.7287/peerj.preprints.2838 ◽

2017 ◽

Author(s):

Atilla Özgür ◽

Hamit Erdem

Keyword(s):

Machine Learning ◽

Performance Metrics ◽

Training Data ◽

Supervised Machine Learning ◽

Support Vector ◽

Data Set ◽

Test Dataset ◽

Negative Rate ◽

Positive Rate ◽

The Impact

Download Full-text

Exploiting Class Learnability in Noisy Data

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33014082 ◽

2019 ◽

Vol 33 ◽

pp. 4082-4089

Author(s):

Matthew Klawonn ◽

Eric Heim ◽

James Hendler

Keyword(s):

Training Data ◽

Supervised Machine Learning ◽

Data Sets ◽

Generalization Error ◽

Data Set ◽

Class Testing ◽

Model Generalization ◽

Improve Model ◽

Real World Applications ◽

And Training

In many domains, collecting sufficient labeled training data for supervised machine learning requires easily accessible but noisy sources, such as crowdsourcing services or tagged Web data. Noisy labels occur frequently in data sets harvested via these means, sometimes resulting in entire classes of data on which learned classifiers generalize poorly. For real world applications, we argue that it can be beneficial to avoid training on such classes entirely. In this work, we aim to explore the classes in a given data set, and guide supervised training to spend time on a class proportional to its learnability. By focusing the training process, we aim to improve model generalization on classes with a strong signal. To that end, we develop an online algorithm that works in conjunction with classifier and training algorithm, iteratively selecting training data for the classifier based on how well it appears to generalize on each class. Testing our approach on a variety of data sets, we show our algorithm learns to focus on classes for which the model has low generalization error relative to strong baselines, yielding a classifier with good performance on learnable classes.

Download Full-text

Predicting Tactical Solutions to Operational Planning Problems Under Imperfect Information

INFORMS Journal on Computing ◽

10.1287/ijoc.2021.1091 ◽

2021 ◽

Author(s):

Eric Larsen ◽

Sébastien Lachapelle ◽

Yoshua Bengio ◽

Emma Frejinger ◽

Simon Lacoste-Julien ◽

...

Keyword(s):

Machine Learning ◽

High Speed ◽

Imperfect Information ◽

Predictive Accuracy ◽

Computing Time ◽

Training Data ◽

Supervised Machine Learning ◽

Two Stage ◽

Data Set ◽

Second Stage

This paper offers a methodological contribution at the intersection of machine learning and operations research. Namely, we propose a methodology to quickly predict expected tactical descriptions of operational solutions (TDOSs). The problem we address occurs in the context of two-stage stochastic programming, where the second stage is demanding computationally. We aim to predict at a high speed the expected TDOS associated with the second-stage problem, conditionally on the first-stage variables. This may be used in support of the solution to the overall two-stage problem by avoiding the online generation of multiple second-stage scenarios and solutions. We formulate the tactical prediction problem as a stochastic optimal prediction program, whose solution we approximate with supervised machine learning. The training data set consists of a large number of deterministic operational problems generated by controlled probabilistic sampling. The labels are computed based on solutions to these problems (solved independently and offline), employing appropriate aggregation and subselection methods to address uncertainty. Results on our motivating application on load planning for rail transportation show that deep learning models produce accurate predictions in very short computing time (milliseconds or less). The predictive accuracy is close to the lower bounds calculated based on sample average approximation of the stochastic prediction programs.

Download Full-text

Cloning Safe Driving Behavior for Self-Driving Cars using Convolutional Neural Networks

Recent Patents on Computer Science ◽

10.2174/2213275911666181106160002 ◽

2019 ◽

Vol 12 (2) ◽

pp. 120-127 ◽

Cited By ~ 5

Author(s):

Wael Farag

Keyword(s):

Gradient Descent ◽

Autonomous Driving ◽

Driving Behavior ◽

Training Data ◽

Stochastic Gradient Descent ◽

Data Set ◽

Safe Driving ◽

Processing Pipeline ◽

Self Driving Cars ◽

And Training

Background: In this paper, a Convolutional Neural Network (CNN) to learn safe driving behavior and smooth steering manoeuvring, is proposed as an empowerment of autonomous driving technologies. The training data is collected from a front-facing camera and the steering commands issued by an experienced driver driving in traffic as well as urban roads. Methods: This data is then used to train the proposed CNN to facilitate what it is called “Behavioral Cloning”. The proposed Behavior Cloning CNN is named as “BCNet”, and its deep seventeen-layer architecture has been selected after extensive trials. The BCNet got trained using Adam’s optimization algorithm as a variant of the Stochastic Gradient Descent (SGD) technique. Results: The paper goes through the development and training process in details and shows the image processing pipeline harnessed in the development. Conclusion: The proposed approach proved successful in cloning the driving behavior embedded in the training data set after extensive simulations.

Download Full-text