Knowledge-guided analysis of ‘omics’ data using the KnowEnG cloud platform

Mapping Intimacies ◽

10.1101/642124 ◽

2019 ◽

Cited By ~ 2

Author(s):

Charles Blatti ◽

Amin Emad ◽

Matthew J. Berry ◽

Lisa Gatzke ◽

Milt Epstein ◽

...

Keyword(s):

Cost Effective ◽

Knowledge Bases ◽

Machine Learning Algorithms ◽

Knowledge Network ◽

Data Sets ◽

Cancer Data ◽

Data Intensive ◽

Fair Principles ◽

Computing Environments ◽

Computing Platforms

AbstractWe present KnowEnG, a free-to-use computational system for analysis of genomics data sets, designed to accelerate biomedical discovery. It includes tools for popular bioinformatics tasks such as gene prioritization, sample clustering, gene set analysis and expression signature analysis. The system offers ‘knowledge-guided’ data-mining and machine learning algorithms, where user-provided data are analyzed in light of prior information about genes, aggregated from numerous knowledge-bases and encoded in a massive ‘Knowledge Network’. KnowEnG adheres to ‘FAIR’ principles: its tools are easily portable to diverse computing environments, run on the cloud for scalable and cost-effective execution of compute-intensive and data-intensive algorithms, and are interoperable with other computing platforms. They are made available through multiple access modes including a web-portal, and include specialized visualization modules. We present use cases and re-analysis of published cancer data sets using KnowEnG tools and demonstrate its potential value in democratization of advanced tools for the modern genomics era.

Download Full-text

A Comparison Study of Data Mining Algorithms for blood Cancer Prediction

passer ◽

10.24271/psr.29 ◽

2019 ◽

Vol 3 (1) ◽

pp. 174-179

Author(s):

Noor Bahjat ◽

Snwr Jamak

Keyword(s):

Data Mining ◽

Machine Learning Algorithms ◽

Common Disease ◽

Data Sets ◽

Blood Cancer ◽

Cancer Data ◽

Data Mining Algorithms ◽

Detection And Diagnosis ◽

Mining Algorithms ◽

Early Detection And Diagnosis

Cancer is a common disease that threats the life of one of every three people. This dangerous disease urgently requires early detection and diagnosis. The recent progress in data mining methods, such as classification, has proven the need for machine learning algorithms to apply to large datasets. This paper mainly aims to utilise data mining techniques to classify cancer data sets into blood cancer and non-blood cancer based on pre-defined information and post-defined information obtained after blood tests and CT scan tests. This research conducted using the WEKA data mining tool with 10-fold cross-validation to evaluate and compare different classification algorithms, extract meaningful information from the dataset and accurately identify the most suitable and predictive model. This paper depicted that the most suitable classifier with the best ability to predict the cancerous dataset is Multilayer perceptron with an accuracy of 99.3967%.

Download Full-text

A Study on Machine Learning: Elements, Characteristics and Algorithms

International Journal of Engineering & Technology ◽

10.14419/ijet.v7i2.19.13793 ◽

2018 ◽

Vol 7 (2.19) ◽

pp. 31

Author(s):

K Chokkanathan ◽

S Koteeswaran

Keyword(s):

Machine Learning ◽

Programming Languages ◽

Real Time ◽

Learning Algorithms ◽

Cost Effective ◽

Machine Learning Algorithms ◽

Data Sets ◽

Real Time Traffic ◽

Sample Data ◽

Spam Filters

Machine learning algorithms are used immensely for performing most important computational tasks with the help of sample data sets. Most of the cases Machine learning algorithms will provide best solution where the programming languages failed to produce viable and economically cost-effective results. Huge volume of deterministic problems are addressed and tackled by using the available sample data sets. Because of this now a days machine learning concepts are extensively used in computer science and many other fields. But still we need to explore more to implement machine learning in a specific field such as network analysis, stock trading, spam filters, traffic analysis, real-time and non-real time traffic etc., which may not be available in text books. Here I would like to discourse some of the key points that the machine learning researchers and practitioners can make use of them. These include shortcomings and concerns also.

Download Full-text

Computational Approaches in Theranostics: Mining and Predicting Cancer Data

Pharmaceutics ◽

10.3390/pharmaceutics11030119 ◽

2019 ◽

Vol 11 (3) ◽

pp. 119 ◽

Cited By ~ 7

Author(s):

Tânia F. G. G. Cova ◽

Daniel J. Bento ◽

Sandra C. C. Nunes

Keyword(s):

Computational Models ◽

Predictive Analytics ◽

Computational Modelling ◽

Cost Effective ◽

Computational Approaches ◽

Design And Optimization ◽

Cancer Data ◽

Data Intensive ◽

Related Data ◽

Molecular Features

The ability to understand the complexity of cancer-related data has been prompted by the applications of (1) computer and data sciences, including data mining, predictive analytics, machine learning, and artificial intelligence, and (2) advances in imaging technology and probe development. Computational modelling and simulation are systematic and cost-effective tools able to identify important temporal/spatial patterns (and relationships), characterize distinct molecular features of cancer states, and address other relevant aspects, including tumor detection and heterogeneity, progression and metastasis, and drug resistance. These approaches have provided invaluable insights for improving the experimental design of therapeutic delivery systems and for increasing the translational value of the results obtained from early and preclinical studies. The big question is: Could cancer theranostics be determined and controlled in silico? This review describes the recent progress in the development of computational models and methods used to facilitate research on the molecular basis of cancer and on the respective diagnosis and optimized treatment, with particular emphasis on the design and optimization of theranostic systems. The current role of computational approaches is providing innovative, incremental, and complementary data-driven solutions for the prediction, simplification, and characterization of cancer and intrinsic mechanisms, and to promote new data-intensive, accurate diagnostics and therapeutics.

Download Full-text

Robust Distance Measures for kNN Classification of Cancer Data

Cancer Informatics ◽

10.1177/1176935120965542 ◽

2020 ◽

Vol 19 ◽

pp. 117693512096554

Author(s):

Rezvan Ehsani ◽

Finn Drabløs

Keyword(s):

Nearest Neighbor ◽

Machine Learning Algorithms ◽

Distance Measures ◽

Data Sets ◽

Biomedical Data ◽

Similar Data ◽

The Novel ◽

K Nearest Neighbor ◽

Cancer Data

The k-Nearest Neighbor ( kNN) classifier represents a simple and very general approach to classification. Still, the performance of kNN classifiers can often compete with more complex machine-learning algorithms. The core of kNN depends on a “guilt by association” principle where classification is performed by measuring the similarity between a query and a set of training patterns, often computed as distances. The relative performance of kNN classifiers is closely linked to the choice of distance or similarity measure, and it is therefore relevant to investigate the effect of using different distance measures when comparing biomedical data. In this study on classification of cancer data sets, we have used both common and novel distance measures, including the novel distance measures Sobolev and Fisher, and we have evaluated the performance of kNN with these distances on 4 cancer data sets of different type. We find that the performance when using the novel distance measures is comparable to the performance with more well-established measures, in particular for the Sobolev distance. We define a robust ranking of all the distance measures according to overall performance. Several distance measures show robust performance in kNN over several data sets, in particular the Hassanat, Sobolev, and Manhattan measures. Some of the other measures show good performance on selected data sets but seem to be more sensitive to the nature of the classification data. It is therefore important to benchmark distance measures on similar data prior to classification to identify the most suitable measure in each case.

Download Full-text

ArrowSAM: In-Memory Genomics Data Processing Using Apache Arrow

10.1101/741843 ◽

2019 ◽

Cited By ~ 1

Author(s):

Tanveer Ahmad ◽

Nauman Ahmed ◽

Johan Peltenburg ◽

Zaid Al-Ars

Keyword(s):

Cost Effective ◽

Data Representation ◽

Data Sets ◽

Memory Processing ◽

Physical Constraints ◽

Language Interoperability ◽

Sequencing Technologies ◽

Fast Processing ◽

Computing Platforms ◽

Cross Language

AbstractThe rapidly growing size of genomics data bases, driven by advances in sequencing technologies, demands fast and cost-effective processing. However, processing this data creates many challenges, particularly in selecting appropriate algorithms and computing platforms. Computing systems need data closer to the processor for fast processing. Traditionally, due to cost, volatility and other physical constraints of DRAM, it was not feasible to place large amounts of working data sets in memory. However, new emerging storage class memories allow storing and processing big data closer to the processor. In this work, we show how the commonly used genomics data format, Sequence Alignment/Map (SAM), can be presented in the Apache Arrow in-memory data representation to benefit of in-memory processing and to ensure better scalability through shared memory objects, by avoiding large (de)-serialization overheads in cross-language interoperability. To demonstrate the benefits of such a system, we propose ArrowSAM, an in-memory SAM format that uses the Apache Arrow framework, and integrate it into genome pre-processing pipelines including BWA-MEM, Picard and Sambamba. Results show 15x and 2.4x speedups as compared to Picard and Sambamba, respectively. The code and scripts for running all workflows are freely available at https://github.com/abs-tudelft/ArrowSAM.

Download Full-text

Emperical Evaluation of Machine Learning algorithms for Breast Cancer Data Classification

International Journal of Computer Sciences and Engineering ◽

10.26438/ijcse/v6i10.346351 ◽

2018 ◽

Vol 6 (10) ◽

pp. 346-351

Author(s):

S. Kumaravel ◽

S. Ophilia Domanica Vithya

Keyword(s):

Breast Cancer ◽

Machine Learning ◽

Learning Algorithms ◽

Data Classification ◽

Machine Learning Algorithms ◽

Breast Cancer Data ◽

Cancer Data

Download Full-text

A cost-effective power-aware approach for scheduling cloudlets in cloud computing environments

The Journal of Supercomputing ◽

10.1007/s11227-021-03894-2 ◽

2021 ◽

Author(s):

Minhaj Ahmad Khan

Keyword(s):

Cloud Computing ◽

Cost Effective ◽

Effective Power ◽

Computing Environments

Download Full-text

Using Machine Learning Methods to Identify Particle Types from Doppler Lidar Measurements in Iceland

Remote Sensing ◽

10.3390/rs13132433 ◽

2021 ◽

Vol 13 (13) ◽

pp. 2433

Author(s):

Shu Yang ◽

Fengchao Peng ◽

Sibylle von Löwis ◽

Guðrún Nína Petersen ◽

David Christian Finger

Keyword(s):

Machine Learning ◽

Weather Conditions ◽

Dust Storms ◽

Machine Learning Algorithms ◽

Lidar Data ◽

Data Sets ◽

Doppler Lidar ◽

Lidar Measurements ◽

Using Data ◽

Filter Noise

Doppler lidars are used worldwide for wind monitoring and recently also for the detection of aerosols. Automatic algorithms that classify the lidar signals retrieved from lidar measurements are very useful for the users. In this study, we explore the value of machine learning to classify backscattered signals from Doppler lidars using data from Iceland. We combined supervised and unsupervised machine learning algorithms with conventional lidar data processing methods and trained two models to filter noise signals and classify Doppler lidar observations into different classes, including clouds, aerosols and rain. The results reveal a high accuracy for noise identification and aerosols and clouds classification. However, precipitation detection is underestimated. The method was tested on data sets from two instruments during different weather conditions, including three dust storms during the summer of 2019. Our results reveal that this method can provide an efficient, accurate and real-time classification of lidar measurements. Accordingly, we conclude that machine learning can open new opportunities for lidar data end-users, such as aviation safety operators, to monitor dust in the vicinity of airports.

Download Full-text

A novel framework for designing a multi-DoF prosthetic wrist control using machine learning

Scientific Reports ◽

10.1038/s41598-021-94449-1 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Chinmay P. Swami ◽

Nicholas Lenhard ◽

Jiyeon Kang

Keyword(s):

Machine Learning ◽

Random Forest ◽

Upper Limb ◽

Daily Living ◽

Machine Learning Algorithms ◽

Data Sets ◽

Random Forest Regression ◽

Prosthetic Devices ◽

Upper Limb Function ◽

The Neural Network

AbstractProsthetic arms can significantly increase the upper limb function of individuals with upper limb loss, however despite the development of various multi-DoF prosthetic arms the rate of prosthesis abandonment is still high. One of the major challenges is to design a multi-DoF controller that has high precision, robustness, and intuitiveness for daily use. The present study demonstrates a novel framework for developing a controller leveraging machine learning algorithms and movement synergies to implement natural control of a 2-DoF prosthetic wrist for activities of daily living (ADL). The data was collected during ADL tasks of ten individuals with a wrist brace emulating the absence of wrist function. Using this data, the neural network classifies the movement and then random forest regression computes the desired velocity of the prosthetic wrist. The models were trained/tested with ADLs where their robustness was tested using cross-validation and holdout data sets. The proposed framework demonstrated high accuracy (F-1 score of 99% for the classifier and Pearson’s correlation of 0.98 for the regression). Additionally, the interpretable nature of random forest regression was used to verify the targeted movement synergies. The present work provides a novel and effective framework to develop an intuitive control for multi-DoF prosthetic devices.

Download Full-text

Fundamental resource trade-offs for encoded distributed optimization

Information and Inference A Journal of the IMA ◽

10.1093/imaiai/iaaa026 ◽

2020 ◽

Author(s):

A Salman Avestimehr ◽

Seyed Mohammadreza Mousavi Kalan ◽

Mahdi Soltanolkotabi

Keyword(s):

Computational Time ◽

Massive Data ◽

Data Sets ◽

Massive Data Sets ◽

Computational Framework ◽

Data Set ◽

Trade Offs ◽

Major Bottleneck ◽

Computing Environments ◽

Analyze Data

Abstract Dealing with the shear size and complexity of today’s massive data sets requires computational platforms that can analyze data in a parallelized and distributed fashion. A major bottleneck that arises in such modern distributed computing environments is that some of the worker nodes may run slow. These nodes a.k.a. stragglers can significantly slow down computation as the slowest node may dictate the overall computational time. A recent computational framework, called encoded optimization, creates redundancy in the data to mitigate the effect of stragglers. In this paper, we develop novel mathematical understanding for this framework demonstrating its effectiveness in much broader settings than was previously understood. We also analyze the convergence behavior of iterative encoded optimization algorithms, allowing us to characterize fundamental trade-offs between convergence rate, size of data set, accuracy, computational load (or data redundancy) and straggler toleration in this framework.

Download Full-text