Machine learning meets genome assembly

2018 ◽  
Vol 20 (6) ◽  
pp. 2116-2129 ◽  
Author(s):  
Kleber Padovani de Souza ◽  
João Carlos Setubal ◽  
André Carlos Ponce de Leon F. de Carvalho ◽  
Guilherme Oliveira ◽  
Annie Chateau ◽  
...  

Abstract Motivation: With the recent advances in DNA sequencing technologies, the study of the genetic composition of living organisms has become more accessible for researchers. Several advances have been achieved because of it, especially in the health sciences. However, many challenges which emerge from the complexity of sequencing projects remain unsolved. Among them is the task of assembling DNA fragments from previously unsequenced organisms, which is classified as an NP-hard (nondeterministic polynomial time hard) problem, for which no efficient computational solution with reasonable execution time exists. However, several tools that produce approximate solutions have been used with results that have facilitated scientific discoveries, although there is ample room for improvement. As with other NP-hard problems, machine learning algorithms have been one of the approaches used in recent years in an attempt to find better solutions to the DNA fragment assembly problem, although still at a low scale. Results: This paper presents a broad review of pioneering literature comprising artificial intelligence-based DNA assemblers—particularly the ones that use machine learning—to provide an overview of state-of-the-art approaches and to serve as a starting point for further study in this field.

Algorithms ◽  
2021 ◽  
Vol 14 (6) ◽  
pp. 187
Author(s):  
Aaron Barbosa ◽  
Elijah Pelofske ◽  
Georg Hahn ◽  
Hristo N. Djidjev

Quantum annealers, such as the device built by D-Wave Systems, Inc., offer a way to compute solutions of NP-hard problems that can be expressed in Ising or quadratic unconstrained binary optimization (QUBO) form. Although such solutions are typically of very high quality, problem instances are usually not solved to optimality due to imperfections of the current generations quantum annealers. In this contribution, we aim to understand some of the factors contributing to the hardness of a problem instance, and to use machine learning models to predict the accuracy of the D-Wave 2000Q annealer for solving specific problems. We focus on the maximum clique problem, a classic NP-hard problem with important applications in network analysis, bioinformatics, and computational chemistry. By training a machine learning classification model on basic problem characteristics such as the number of edges in the graph, or annealing parameters, such as the D-Wave’s chain strength, we are able to rank certain features in the order of their contribution to the solution hardness, and present a simple decision tree which allows to predict whether a problem will be solvable to optimality with the D-Wave 2000Q. We extend these results by training a machine learning regression model that predicts the clique size found by D-Wave.


Author(s):  
Hannah Bolinger ◽  
David Tran ◽  
Kenneth Harary ◽  
George C. Paoli ◽  
Giselle Guron ◽  
...  

Traditional microbiological testing methods are slow, and many molecular-based techniques rely on culture-based enrichment to overcome low limits of detection. Recent advancements in sequencing technologies may make it possible to utilize machine learning (ML) to identify patterns in microbiome data to potentially predict the presence or absence of pathogens. In this study, 299 poultry rinsate samples from various points in the processing chain were analyzed to determine if microbiota could inform about a sample’s risk for containing Salmonella . Samples were culture confirmed as Salmonella -positive or -negative following modified USDA MLG protocols. The culture confirmation result was used as a reference to compare with 16S sequencing data. Pre-chill samples tested positive (71/82) at a higher frequency than post-chill samples (30/217) and contained greater microbial diversity. Due to their larger sample size, post-chill samples were analyzed more deeply. Analysis of variance (ANOVA) identified a significant effect of chilling on the number of genera (p<0.001), but analysis of similarities (ANOSIM) failed to provide evidence for microbial dissimilarity between pre- and post-chill samples (p=0.001, R=0.443). Various ML models were trained using post-chill samples to predict if a sample contained Salmonella based on the samples’ microbiota pre-enrichment. The optimal model was a Random Forest-based model with a performance as follows: accuracy (88%), sensitivity (85%), specificity (90%). While the algorithms described in this paper are prototypes, these risk-based algorithms demonstrate the potential and need for further studies to provide insight alongside diagnostic tests. Combining risk-based information with diagnostic tools can help poultry processors make informed decisions to help identify and prevent the spread of Salmonella . These data add to the growing body of literature exploring novel ways to utilize microbiome data for predictive food safety.


Author(s):  
Lidong Wu

The No-Free-Lunch theorem is an interesting and important theoretical result in machine learning. Based on philosophy of No-Free-Lunch theorem, we discuss extensively on the limitation of a data-driven approach in solving NP-hard problems.


Author(s):  
Miss. Samiksha Arvind Kale ◽  
Prof. Dr A .B . Gadicha

Heart plays significant role in living organisms. Diagnosis and prediction of heart related diseases requires more precision, perfection and correctness because slightly mistake can cause fatigue problem or death of the person, there are numerous death cases related to heart and their counting is increasing exponentially day by day. To affect the matter there's essential need of prediction system for awareness about diseases Machine learning is that the branch of AI (AI), it provides prestigious support in predicting any quite event which take training from natural events. During this paper, we calculate accuracy of machine learning algorithms for predicting heart condition, for this algorithms are k-nearest neighbor, decision tree, linear regression and support vector machine (SVM) by using UCI repository dataset for training and testing. For implementation of Python programming Anaconda (jupytor) notebook is best tool, which have many kind of library, header file, that make the work more accurate and precise.


2022 ◽  
Vol 65 (1) ◽  
pp. 76-85
Author(s):  
Lance Fortnow

Advances in algorithms, machine learning, and hardware can help tackle many NP-hard problems once thought impossible.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Beatrix C. Hiesmayr

AbstractEntanglement detection in high dimensional systems is a NP-hard problem since it is lacking an efficient way. Given a bipartite quantum state of interest free entanglement can be detected efficiently by the PPT-criterion (Peres-Horodecki criterion), in contrast to detecting bound entanglement, i.e. a curious form of entanglement that can also not be distilled into maximally (free) entangled states. Only a few bound entangled states have been found, typically by constructing dedicated entanglement witnesses, so naturally the question arises how large is the volume of those states. We define a large family of magically symmetric states of bipartite qutrits for which we find $$82\%$$ 82 % to be free entangled, $$2\%$$ 2 % to be certainly separable and as much as $$10\%$$ 10 % to be bound entangled, which shows that this kind of entanglement is not rare. Via various machine learning algorithms we can confirm that the remaining $$6\%$$ 6 % of states are more likely to belonging to the set of separable states than bound entangled states. Most important we find via dimension reduction algorithms that there is a strong two-dimensional (linear) sub-structure in the set of bound entangled states. This revealed structure opens a novel path to find and characterize bound entanglement towards solving the long-standing problem of what the existence of bound entanglement is implying.


2021 ◽  
Vol 10 (5) ◽  
pp. 2857-2865
Author(s):  
Moanda Diana Pholo ◽  
Yskandar Hamam ◽  
Abdel Baset Khalaf ◽  
Chunling Du

Available literature reports several lymphoma cases misdiagnosed as tuberculosis, especially in countries with a heavy TB burden. This frequent misdiagnosis is due to the fact that the two diseases can present with similar symptoms. The present study therefore aims to analyse and explore TB as well as lymphoma case reports using Natural Language Processing tools and evaluate the use of machine learning to differentiate between the two diseases. As a starting point in the study, case reports were collected for each disease using web scraping. Natural language processing tools and text clustering were then used to explore the created dataset. Finally, six machine learning algorithms were trained and tested on the collected data, which contained 765 lymphoma and 546 tuberculosis case reports. Each method was evaluated using various performance metrics. The results indicated that the multi-layer perceptron model achieved the best accuracy (93.1%), recall (91.9%) and precision score (93.7%), thus outperforming other algorithms in terms of correctly classifying the different case reports.


2021 ◽  
Vol 1 (1) ◽  
pp. 146-176
Author(s):  
Israa Nadher ◽  
Mohammad Ayache ◽  
Hussein Kanaan

Abstract—Information decision support systems are becomingmore in use as we are living in the era of digital data andrise of artificial intelligence. Heart disease as one of the mostknown and dangerous is getting very important attention, thisattention is translated into digital and prediction system thatdetects the presence of disease according to the available dataand information. Such systems faced a lot of problems since thefirst rise, but now with the deveolopment of machine learnigfield we are using them in developing new models to detect thepresence of this disease, in addition to algorithms data is veryimportant which also form the heart of the predicton systems,as we know prediction algorithms take decisions and thesedecisions must be based on facts, and these facts are extractedfrom data, as a result data is the starting point of every system.In this paper we propose a Heart Disease Prediction Systemusing Machine Learning Algorithms, in terms of data we usedCleveland dataset, this dataset is normalized then divided intothree scnearios in terms of traning and testing respectively,80%-20%, 50%-50%, 30%-70%. In each case of dataset ifit is normalized or not we will have these three scenarios.We used three machine learning algorithms for every scenarioof the mentioned before which are SVM, SMO and MLP, inthese algorithms we’ve used two different kernels to test theresults upon that. These two types of simulation are added tothe collection of scenarios mentioned above to become as thefollowing we have at the main level two types normalized andunnormalized dataset, then for each one we have three typesaccording to the amount of training and testing dataset, thenfor each of these scenarios we have two scenarios according tothe type of kernel to become 30 scenarios in total, our proposedsystem have shown a dominance in terms of accuracy over theother previous works.


Sign in / Sign up

Export Citation Format

Share Document