A learning-based framework for miRNA-disease association identification using neural networks

Mapping Intimacies ◽

10.1101/276048 ◽

2018 ◽

Cited By ~ 9

Author(s):

Jiajie Peng ◽

Weiwei Hui ◽

Qianqian Li ◽

Bolin Chen ◽

Qinghua Jiang ◽

...

Keyword(s):

State Of The Art ◽

Essential Feature ◽

Source Code ◽

Disease Association ◽

Feature Representation ◽

Supplementary Information ◽

Feature Combination ◽

Biological Processes ◽

Non Coding Rna ◽

Supplementary Material

AbstractMotivationA microRNA (miRNA) is a type of non-coding RNA, which plays important roles in many biological processes. Lots of studies have shown that miRNAs are implicated in human diseases, indicating that miRNAs might be potential biomarkers for various types of diseases. Therefore, it is important to reveal the relationships between miRNAs and diseases/phenotypes.ResultsWe propose a novel learning-based framework, MDA-CNN, for miRNA-disease association identification. The model first captures richer interaction features between diseases and miRNAs based on a three-layer network with an additional gene layer. Then, it employs an auto-encoder to identify the essential feature combination for each pair of miRNA and disease automatically. Finally, taking the reduced feature representation as input, it uses a convolutional neural network to predict the final label. The evaluation results show that the proposed framework outperforms some state-of-the-art approaches in a large margin on both tasks of miRNA-disease association prediction and miRNA-phenotype association prediction.AvailabilityThe source code and data are available at https://github.com/Issingjessica/[email protected];[email protected];[email protected] informationSupplementary data are available at Bioinformatics online.

Download Full-text

A learning-based framework for miRNA-disease association identification using neural networks

Bioinformatics ◽

10.1093/bioinformatics/btz254 ◽

2019 ◽

Vol 35 (21) ◽

pp. 4364-4371 ◽

Cited By ~ 35

Author(s):

Jiajie Peng ◽

Weiwei Hui ◽

Qianqian Li ◽

Bolin Chen ◽

Jianye Hao ◽

...

Keyword(s):

Essential Feature ◽

Interaction Network ◽

Disease Association ◽

Feature Representation ◽

Supplementary Information ◽

Similarity Network ◽

Protein Protein Interaction ◽

Non Coding Rna ◽

Disease Similarity ◽

Protein Protein Interaction Network

Abstract Motivation A microRNA (miRNA) is a type of non-coding RNA, which plays important roles in many biological processes. Lots of studies have shown that miRNAs are implicated in human diseases, indicating that miRNAs might be potential biomarkers for various types of diseases. Therefore, it is important to reveal the relationships between miRNAs and diseases/phenotypes. Results We propose a novel learning-based framework, MDA-CNN, for miRNA-disease association identification. The model first captures interaction features between diseases and miRNAs based on a three-layer network including disease similarity network, miRNA similarity network and protein-protein interaction network. Then, it employs an auto-encoder to identify the essential feature combination for each pair of miRNA and disease automatically. Finally, taking the reduced feature representation as input, it uses a convolutional neural network to predict the final label. The evaluation results show that the proposed framework outperforms some state-of-the-art approaches in a large margin on both tasks of miRNA-disease association prediction and miRNA-phenotype association prediction. Availability and implementation The source code and data are available at https://github.com/Issingjessica/MDA-CNN. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

HOPS: high-performance library for (non-)uniform sampling of convex-constrained models

Bioinformatics ◽

10.1093/bioinformatics/btaa872 ◽

2020 ◽

Author(s):

Johann F Jadebeck ◽

Axel Theorell ◽

Samuel Leweke ◽

Katharina Nöh

Keyword(s):

High Performance ◽

State Of The Art ◽

Source Code ◽

Third Party ◽

Supplementary Information ◽

Scalable Algorithms ◽

Uniform Sampling ◽

Non Uniform Sampling ◽

Constrained Models ◽

Performance Gains

Abstract Summary The C++ library Highly Optimized Polytope Sampling (HOPS) provides implementations of efficient and scalable algorithms for sampling convex-constrained models that are equipped with arbitrary target functions. For uniform sampling, substantial performance gains were achieved compared to the state-of-the-art. The ease of integration and utility of non-uniform sampling is showcased in a Bayesian inference setting, demonstrating how HOPS interoperates with third-party software. Availability and implementation Source code is available at https://github.com/modsim/hops/, tested on Linux and MS Windows, includes unit tests, detailed documentation, example applications and a Dockerfile. Contact [email protected] Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Temporal network alignment via GoT-WAVE

Bioinformatics ◽

10.1093/bioinformatics/btz119 ◽

2019 ◽

Vol 35 (18) ◽

pp. 3527-3529 ◽

Cited By ~ 3

Author(s):

David Aparício ◽

Pedro Ribeiro ◽

Tijana Milenković ◽

Fernando Silva

Keyword(s):

User Interface ◽

State Of The Art ◽

Source Code ◽

Network Alignment ◽

Supplementary Information ◽

Temporal Network ◽

Temporal Networks ◽

Supplementary Data ◽

Node Similarity ◽

User Friendly

Abstract Motivation Network alignment (NA) finds conserved regions between two networks. NA methods optimize node conservation (NC) and edge conservation. Dynamic graphlet degree vectors are a state-of-the-art dynamic NC measure, used within the fastest and most accurate NA method for temporal networks: DynaWAVE. Here, we use graphlet-orbit transitions (GoTs), a different graphlet-based measure of temporal node similarity, as a new dynamic NC measure within DynaWAVE, resulting in GoT-WAVE. Results On synthetic networks, GoT-WAVE improves DynaWAVE’s accuracy by 30% and speed by 64%. On real networks, when optimizing only dynamic NC, the methods are complementary. Furthermore, only GoT-WAVE supports directed edges. Hence, GoT-WAVE is a promising new temporal NA algorithm, which efficiently optimizes dynamic NC. We provide a user-friendly user interface and source code for GoT-WAVE. Availability and implementation http://www.dcc.fc.up.pt/got-wave/ Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Faucet: streaming de novo assembly graph construction

10.1101/125658 ◽

2017 ◽

Author(s):

Roye Rozov ◽

Gil Goldshlager ◽

Eran Halperin ◽

Ron Shamir

Keyword(s):

Resource Use ◽

De Novo ◽

State Of The Art ◽

Supplementary Information ◽

De Bruijn Graph ◽

Assembly Quality ◽

Metagenome Assembly ◽

Streaming Algorithm ◽

Supplementary Material ◽

De Bruijn

AbstractMotivationWe present Faucet, a 2-pass streaming algorithm for assembly graph construction. Faucet builds an assembly graph incrementally as each read is processed. Thus, reads need not be stored locally, as they can be processed while downloading data and then discarded. We demonstrate this functionality by performing streaming graph assembly of publicly available data, and observe that the ratio of disk use to raw data size decreases as coverage is increased.ResultsFaucet pairs the de Bruijn graph obtained from the reads with additional meta-data derived from them. We show these metadata - coverage counts collected at junction k-mers and connections bridging between junction pairs - contain most salient information needed for assembly, and demonstrate they enable cleaning of metagenome assembly graphs, greatly improving contiguity while maintaining accuracy. We compared Faucet’s resource use and assembly quality to state of the art metagenome assemblers, as well as leading resource-efficient genome assemblers. Faucet used orders of magnitude less time and disk space than the specialized metagenome assemblers MetaSPAdes and Megahit, while also improving on their memory use; this broadly matched performance of other assemblers optimizing resource efficiency - namely, Minia and LightAssembler. However, on metagenomes tested, Faucet’s outputs had 14-110% higher mean NGA50 lengths compared to Minia, and 2-11-fold higher mean NGA50 lengths compared to LightAssembler, the only other streaming assembler available.AvailabilityFaucet is available at https://github.com/Shamir-Lab/[email protected],[email protected] information:Supplementary data are available at Bioinformatics online.

Download Full-text

PATO: Pangenome Analysis Toolkit

10.1101/2021.01.30.428878 ◽

2021 ◽

Author(s):

Miguel D. Fernández-de-Bobadilla ◽

Alba Talavera-Rodríguez ◽

Lucía Chacón ◽

Fernando Baquero ◽

Teresa M. Coque ◽

...

Keyword(s):

Population Structure ◽

Statistical Analysis ◽

Core Genome ◽

State Of The Art ◽

Source Code ◽

Supplementary Information ◽

Complete Analysis ◽

Large Set ◽

Supplementary Data ◽

Desktop Computer

AbstractMotivationComparative genomics is a growing field but one that will be eventually overtaken by sample size studies and the increase of available genomes in public databases. We present the Pangenome Analysis Toolkit (PATO) designed to simultaneously analyze thousands of genomes using a desktop computer. The tool performs common tasks of pangenome analysis such as core-genome definition and accessory genome properties and includes new features that help characterize population structure, annotate pathogenic features and create gene sharedness networks. PATO has been developed in R to integrate with the large set of tools available for genetic, phylogenetic and statistical analysis in this environment.ResultsPATO can perform the most demanding bioinformatic analyses in minutes with an accuracy comparable to state-of-the-art software but 20–30x times faster. PATO also integrates all the necessary functions for the complete analysis of the most common objectives in microbiology studies. Lastly, PATO includes the necessary tools for visualizing the results and can be integrated with other analytical packages available in R.AvailabilityThe source code for PATO is freely available at https://github.com/irycisBioinfo/PATO under the GPLv3 [email protected] informationSupplementary data are available at Bioinformatics online

Download Full-text

GlycanFormatConverter: a conversion tool for translating the complexities of glycans

Bioinformatics ◽

10.1093/bioinformatics/bty990 ◽

2018 ◽

Vol 35 (14) ◽

pp. 2434-2440 ◽

Cited By ~ 7

Author(s):

Shinichiro Tsuchiya ◽

Issaku Yamada ◽

Kiyoko F Aoki-Kinoshita

Keyword(s):

Open Source ◽

Source Code ◽

Supplementary Information ◽

Biological Processes ◽

Supplementary Data ◽

Unique Representation ◽

Open Source Tool ◽

Living Organisms ◽

Conversion Tool ◽

Complex Glycan

Abstract Motivation Glycans are biomolecules that take an important role in the biological processes of living organisms. They form diverse, complicated structures such as branched and cyclic forms. Web3 Unique Representation of Carbohydrate Structures (WURCS) was proposed as a new linear notation for uniquely representing glycans during the GlyTouCan project. WURCS defines rules for complex glycan structures that other text formats did not support, and so it is possible to represent a wide variety glycans. However, WURCS uses a complicated nomenclature, so it is not human-readable. Therefore, we aimed to support the interpretation of WURCS by converting WURCS to the most basic and widely used format IUPAC. Results In this study, we developed GlycanFormatConverter and succeeded in converting WURCS to the three kinds of IUPAC formats (IUPAC-Extended, IUPAC-Condensed and IUPAC-Short). Furthermore, we have implemented functionality to import IUPAC-Extended, KEGG Chemical Function (KCF) and LinearCode formats and to export WURCS. We have thoroughly tested our GlycanFormatConverter and were able to show that it was possible to convert all the glycans registered in the GlyTouCan repository, with exceptions owing only to the limitations of the original format. The source code for this conversion tool has been released as an open source tool. Availability and implementation https://github.com/glycoinfo/GlycanFormatConverter.git Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

FQSqueezer: k-mer-based compression of sequencing data

10.1101/559807 ◽

2019 ◽

Cited By ~ 1

Author(s):

Sebastian Deorowicz

Keyword(s):

Data Compression ◽

State Of The Art ◽

Genomic Data ◽

General Purpose ◽

Supplementary Information ◽

Supplementary Data ◽

Sequencing Data ◽

Partial Matching ◽

Supplementary Material ◽

Better Than

AbstractMotivationThe amount of genomic data that needs to be stored is huge. Therefore it is not surprising that a lot of work has been done in the field of specialized data compression of FASTQ files. The existing algorithms are, however, still imperfect and the best tools produce quite large archives.ResultsWe present FQSqueezer, a novel compression algorithm for sequencing data able to process single- and paired-end reads of variable lengths. It is based on the ideas from the famous prediction by partial matching and dynamic Markov coder algorithms known from the general-purpose-compressors world. The compression ratios are often tens of percent better than offered by the state-of-the-art tools.Availability and Implementationhttps://github.com/refresh-bio/[email protected] informationSupplementary data are available at publisher’s Web site.

Download Full-text

MolRep: A Deep Representation Learning Library for Molecular Property Prediction

10.1101/2021.01.13.426489 ◽

2021 ◽

Author(s):

Jiahua Rao ◽

Shuangjia Zheng ◽

Ying Song ◽

Jianwen Chen ◽

Chengtao Li ◽

...

Keyword(s):

State Of The Art ◽

Source Code ◽

Representation Learning ◽

Supplementary Information ◽

Data Sets ◽

Supplementary Data ◽

Property Prediction ◽

Average Rank ◽

Benchmark Data ◽

Classification Tasks

AbstractSummaryRecently, novel representation learning algorithms have shown potential for predicting molecular properties. However, unified frameworks have not yet emerged for fairly measuring algorithmic progress, and experimental procedures of different representation models often lack rigorousness and are hardly reproducible. Herein, we have developed MolRep by unifying 16 state-of-the-art models across 4 popular molecular representations for application and comparison. Furthermore, we ran more than 12.5 million experiments to optimize hyperparameters for each method on 12 common benchmark data sets. As a result, CMPNN achieves the best results ranked the 1st in 5 out of 12 tasks with an average rank of 1.75. Relatively, ECC has good performance in classification tasks and MAT good for regression (both ranked 1st for 3 tasks) with an average rank of 2.71 and 2.6, respectively.AvailabilityThe source code is available at: https://github.com/biomed-AI/MolRepSupplementary informationSupplementary data are available online.

Download Full-text

Nubeam-dedup: a fast and RAM-efficient tool to de-duplicate sequencing reads without mapping

Bioinformatics ◽

10.1093/bioinformatics/btaa112 ◽

2020 ◽

Vol 36 (10) ◽

pp. 3254-3256 ◽

Cited By ~ 2

Author(s):

Hang Dai ◽

Yongtao Guan

Keyword(s):

Hash Function ◽

Reference Genome ◽

State Of The Art ◽

Source Code ◽

Supplementary Information ◽

Supplementary Data ◽

Efficient Tool ◽

Cpu Time ◽

Products Of Matrices

Abstract Summary We present Nubeam-dedup, a fast and RAM-efficient tool to de-duplicate sequencing reads without reference genome. Nubeam-dedup represents nucleotides by matrices, transforms reads into products of matrices, and based on which assigns a unique number to a read. Thus, duplicate reads can be efficiently removed by using a collisionless hash function. Compared with other state-of-the-art reference-free tools, Nubeam-dedup uses 50–70% of CPU time and 10–15% of RAM. Availability and implementation Source code in C++ and manual are available at https://github.com/daihang16/nubeamdedup and https://haplotype.org. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

CoCoNet: an efficient deep learning tool for viral metagenome binning

Bioinformatics ◽

10.1093/bioinformatics/btab213 ◽

2021 ◽

Author(s):

Cédric G Arisdakessian ◽

Olivia D Nigro ◽

Grieg F Steward ◽

Guylaine Poisson ◽

Mahdi Belcaid

Keyword(s):

Deep Learning ◽

Viral Genome ◽

High Performance ◽

Source Code ◽

Supplementary Information ◽

Biological Processes ◽

Bacterial Genomes ◽

Large Dataset ◽

Sequence Composition ◽

Rigorous Framework

Abstract Motivation Metagenomic approaches hold the potential to characterize microbial communities and unravel the intricate link between the microbiome and biological processes. Assembly is one of the most critical steps in metagenomics experiments. It consists of transforming overlapping DNA sequencing reads into sufficiently accurate representations of the community’s genomes. This process is computationally difficult and commonly results in genomes fragmented across many contigs. Computational binning methods are used to mitigate fragmentation by partitioning contigs based on their sequence composition, abundance or chromosome organization into bins representing the community’s genomes. Existing binning methods have been principally tuned for bacterial genomes and do not perform favorably on viral metagenomes. Results We propose Composition and Coverage Network (CoCoNet), a new binning method for viral metagenomes that leverages the flexibility and the effectiveness of deep learning to model the co-occurrence of contigs belonging to the same viral genome and provide a rigorous framework for binning viral contigs. Our results show that CoCoNet substantially outperforms existing binning methods on viral datasets. Availability and implementation CoCoNet was implemented in Python and is available for download on PyPi (https://pypi.org/). The source code is hosted on GitHub at https://github.com/Puumanamana/CoCoNet and the documentation is available at https://coconet.readthedocs.io/en/latest/index.html. CoCoNet does not require extensive resources to run. For example, binning 100k contigs took about 4 h on 10 Intel CPU Cores (2.4 GHz), with a memory peak at 27 GB (see Supplementary Fig. S9). To process a large dataset, CoCoNet may need to be run on a high RAM capacity server. Such servers are typically available in high-performance or cloud computing settings. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text