scholarly journals Convolutional neural net learns promoter sequence features driving transcription strength

10.29007/8fmw ◽  
2020 ◽  
Author(s):  
Nicholas Leiby ◽  
Ayaan Hossain ◽  
Howard M Salis

Promoters drive gene expression and help regulate cellular responses to the environment. In recent research, machine learning models have been developed to predict a bacterial promoter’s transcriptional initiation rate, although these models utilize expert-labeled sequence elements across a defined set of DNA building blocks. The generalizability of these methods is therefore limited by the necessary labeling of the specific components studied. As a result, current models have not been used to predict the transcriptional initiation rates of promoters with generalized nucleotide sequences. If generalizable models existed, they could greatly facilitate the design of synthetic genetic circuits with well-controlled transcription rates in bacteria.To address these limitations, we used a convolutional neural network (CNN) to predict a promoter’s transcriptional initiation rate directly from its DNA nucleotide sequence. We first evaluated the model on a published promoter component dataset. Trained using only the sequence as input, our model fits held-out test data with R2​ ​= 0.90, comparable to published models that fit expert-labeled sequence elements.We produced a new promoter strength dataset including non-repetitive promoters with high sequence variation and not limited to combinations of discrete expert-labeled components. Our CNN trained on this more varied dataset fits held-out promoter strength with R2​ ​= 0.61. Previously-published models are intractable on a dataset like this with highly diverse inputs. The CNN outperforms classical approach baselines like LASSO on a bag of words for promoter sequence elements (R2​ ​= 0.42).We applied recent machine learning approaches to quantify the contribution of individual nucleotides to the CNN's promoter strength prediction. Learning directly from DNA sequence, our model identified the consensus -35 and -10 hexamer regions as well as the discriminator element as keycontributorstoσ7​0​promoterstrength.Italsoreplicatedafindingthataperfectconsensus sequence match does not yield the strongest promoter.The model's ability to independently learn biologically-relevant information directly from sequence, while performing similarly to or better than classical methods, makes it appealing for further prediction optimization and research into generalizability. This approach may be useful for synthetic promoter design, as well as for sequence feature identification.

2020 ◽  
Vol 2 (2) ◽  
Author(s):  
Isonkobong Christopher Udousoro

Due to the complexity of data, interpretation of pattern or extraction of information becomes difficult; therefore application of machine learning is used to teach machines how to handle data more efficiently. With the increase of datasets, various organizations now apply machine learning applications and algorithms. Many industries apply machine learning to extract relevant information for analysis purposes. Many scholars, mathematicians and programmers have carried out research and applied several machine learning approaches in order to find solution to problems. In this paper, we focus on general review of machine learning including various machine learning techniques. These techniques can be applied to different fields like image processing, data mining, predictive analysis and so on. The paper aims at reviewing machine learning techniques and algorithms. The research methodology is based on qualitative analysis where various literatures is being reviewed based  on machine learning.


2021 ◽  
Vol 12 ◽  
Author(s):  
Ching-Hsuan Chien ◽  
Lan-Ying Huang ◽  
Shuen-Fang Lo ◽  
Liang-Jwu Chen ◽  
Chi-Chou Liao ◽  
...  

To change the expression of the flanking genes by inserting T-DNA into the genome is commonly used in rice functional gene research. However, whether the expression of a gene of interest is enhanced must be validated experimentally. Consequently, to improve the efficiency of screening activated genes, we established a model to predict gene expression in T-DNA mutants through machine learning methods. We gathered experimental datasets consisting of gene expression data in T-DNA mutants and captured the PROMOTER and MIDDLE sequences for encoding. In first-layer models, support vector machine (SVM) models were constructed with nine features consisting of information about biological function and local and global sequences. Feature encoding based on the PROMOTER sequence was weighted by logistic regression. The second-layer models integrated 16 first-layer models with minimum redundancy maximum relevance (mRMR) feature selection and the LADTree algorithm, which were selected from nine feature selection methods and 65 classified methods, respectively. The accuracy of the final two-layer machine learning model, referred to as TIMgo, was 99.3% based on fivefold cross-validation, and 85.6% based on independent testing. We discovered that the information within the local sequence had a greater contribution than the global sequence with respect to classification. TIMgo had a good predictive ability for target genes within 20 kb from the 35S enhancer. Based on the analysis of significant sequences, the G-box regulatory sequence may also play an important role in the activation mechanism of the 35S enhancer.


PLoS ONE ◽  
2021 ◽  
Vol 16 (1) ◽  
pp. e0244151
Author(s):  
Adam Joseph Ronald Pond ◽  
Seongwon Hwang ◽  
Berta Verd ◽  
Benjamin Steventon

Machine learning approaches are becoming increasingly widespread and are now present in most areas of research. Their recent surge can be explained in part due to our ability to generate and store enormous amounts of data with which to train these models. The requirement for large training sets is also responsible for limiting further potential applications of machine learning, particularly in fields where data tend to be scarce such as developmental biology. However, recent research seems to indicate that machine learning and Big Data can sometimes be decoupled to train models with modest amounts of data. In this work we set out to train a CNN-based classifier to stage zebrafish tail buds at four different stages of development using small information-rich data sets. Our results show that two and three dimensional convolutional neural networks can be trained to stage developing zebrafish tail buds based on both morphological and gene expression confocal microscopy images, achieving in each case up to 100% test accuracy scores. Importantly, we show that high accuracy can be achieved with data set sizes of under 100 images, much smaller than the typical training set size for a convolutional neural net. Furthermore, our classifier shows that it is possible to stage isolated embryonic structures without the need to refer to classic developmental landmarks in the whole embryo, which will be particularly useful to stage 3D culture in vitro systems such as organoids. We hope that this work will provide a proof of principle that will help dispel the myth that large data set sizes are always required to train CNNs, and encourage researchers in fields where data are scarce to also apply ML approaches.


Author(s):  
N. V. Mutovkin

Assessing the phase composition of the fluid in a well based analysis of the frequencies of the radial resonance modes excited by acoustic noise in the inflow zone is a promising method for interpreting the results of passive noise metering. Machine learning makes it possible to take into account many factors affecting the spectrum of the measured signal, extracting from them exactly those factors associated with a change in phase composition. In order to build the best model, machine learning approaches such as linear regression with different variants of regularisation, Bayesian regression, neural net, methods of supporting vectors, decision tree, random forest and gradient boosting are considered. Data sets for training and testing the algorithm were obtained on the basis of scenarios calculated using a two-dimensional mathematical model with the different values of the bed parameters and ratio of volume fractions of the well filling fluids. The effect on the assessment accuracy of the phase composition of various factors, including the presence of acoustic device housing, the foreign noise in the signal and the shape of the signal spectrum, was checked. It is shown that in the absence of data distortion, it is possible to build models that provide an absolute error in the assessment of the phase composition about 1% after the zone of fluid inflow and about 5% in the zone before the inflow.


2020 ◽  
Author(s):  
Chi-Chou Liao ◽  
Liang-Jwu Chen ◽  
Shuen-Fang Lo ◽  
Chi-Wei Chen ◽  
Jia-Jyun Chen ◽  
...  

Abstract Background T-DNA activation-tagging technology is widely used to enhance flanking gene expression near the site of insertion for functional genomics research in rice. However, whether the expression of a gene of interest is enhanced must be validated experimentally. Results In this study, we built a model to predict gene expression in T-DNA mutants by machine learning approaches, thereby improving the efficiency of screening for activated genes. We gathered experimental consisting of gene expression data in T-DNA mutants and captured the PROMOTER and MIDDLE sequences for encoding. In first-layer models, SVM models were constructed with nine features consisting of information about biological function and local and global sequences. Feature-encoding based on the PROMOTER sequence was weighted by logistic regression. The second-layer models integrated 16 first-layer models with feature selection and the algorithm, which were selected from nine feature selection methods and 65 classified methods, respectively. The accuracy of the final two-layer machine learning model, referred to as was 99.3% based on five-fold cross-validation, and 85.6% based on independent-testing. Conclusion We discovered that the information within the local sequence had a greater contribution than the global sequence with respect to classification had a good predictive ability for target genes within 20 from the 35S enhancer. Based on the analysis of significant sequences, the G-box regulatory sequence may also play an important role in the mechanism of activation of the 35S enhancer.


2019 ◽  
Vol 70 (3) ◽  
pp. 214-224
Author(s):  
Bui Ngoc Dung ◽  
Manh Dzung Lai ◽  
Tran Vu Hieu ◽  
Nguyen Binh T. H.

Video surveillance is emerging research field of intelligent transport systems. This paper presents some techniques which use machine learning and computer vision in vehicles detection and tracking. Firstly the machine learning approaches using Haar-like features and Ada-Boost algorithm for vehicle detection are presented. Secondly approaches to detect vehicles using the background subtraction method based on Gaussian Mixture Model and to track vehicles using optical flow and multiple Kalman filters were given. The method takes advantages of distinguish and tracking multiple vehicles individually. The experimental results demonstrate high accurately of the method.


2017 ◽  
Author(s):  
Sabrina Jaeger ◽  
Simone Fulle ◽  
Samo Turk

Inspired by natural language processing techniques we here introduce Mol2vec which is an unsupervised machine learning approach to learn vector representations of molecular substructures. Similarly, to the Word2vec models where vectors of closely related words are in close proximity in the vector space, Mol2vec learns vector representations of molecular substructures that are pointing in similar directions for chemically related substructures. Compounds can finally be encoded as vectors by summing up vectors of the individual substructures and, for instance, feed into supervised machine learning approaches to predict compound properties. The underlying substructure vector embeddings are obtained by training an unsupervised machine learning approach on a so-called corpus of compounds that consists of all available chemical matter. The resulting Mol2vec model is pre-trained once, yields dense vector representations and overcomes drawbacks of common compound feature representations such as sparseness and bit collisions. The prediction capabilities are demonstrated on several compound property and bioactivity data sets and compared with results obtained for Morgan fingerprints as reference compound representation. Mol2vec can be easily combined with ProtVec, which employs the same Word2vec concept on protein sequences, resulting in a proteochemometric approach that is alignment independent and can be thus also easily used for proteins with low sequence similarities.


2018 ◽  
Author(s):  
Sherif Tawfik ◽  
Olexandr Isayev ◽  
Catherine Stampfl ◽  
Joseph Shapter ◽  
David Winkler ◽  
...  

Materials constructed from different van der Waals two-dimensional (2D) heterostructures offer a wide range of benefits, but these systems have been little studied because of their experimental and computational complextiy, and because of the very large number of possible combinations of 2D building blocks. The simulation of the interface between two different 2D materials is computationally challenging due to the lattice mismatch problem, which sometimes necessitates the creation of very large simulation cells for performing density-functional theory (DFT) calculations. Here we use a combination of DFT, linear regression and machine learning techniques in order to rapidly determine the interlayer distance between two different 2D heterostructures that are stacked in a bilayer heterostructure, as well as the band gap of the bilayer. Our work provides an excellent proof of concept by quickly and accurately predicting a structural property (the interlayer distance) and an electronic property (the band gap) for a large number of hybrid 2D materials. This work paves the way for rapid computational screening of the vast parameter space of van der Waals heterostructures to identify new hybrid materials with useful and interesting properties.


Sign in / Sign up

Export Citation Format

Share Document