Som-Based Class Discovery Exploring the ICA-Reduced Features of Microarray Expression Profiles

Andrei Dragomir; Seferina Mavroudi; Anastasios Bezerianos

doi:10.1002/cfg.444

Som-Based Class Discovery Exploring the ICA-Reduced Features of Microarray Expression Profiles

Comparative and Functional Genomics ◽

10.1002/cfg.444 ◽

2004 ◽

Vol 5 (8) ◽

pp. 596-616 ◽

Cited By ~ 2

Author(s):

Andrei Dragomir ◽

Seferina Mavroudi ◽

Anastasios Bezerianos

Keyword(s):

Clustering Algorithm ◽

Learning Algorithm ◽

Expression Profiles ◽

Relevant Information ◽

Statistical Dependence ◽

Analysis Tool ◽

Self Organizing Map ◽

Biologically Relevant ◽

Class Discovery ◽

The Cost

Gene expression datasets are large and complex, having many variables and unknown internal structure. We apply independent component analysis (ICA) to derive a less redundant representation of the expression data. The decomposition produces components with minimal statistical dependence and reveals biologically relevant information. Consequently, to the transformed data, we apply cluster analysis (an important and popular analysis tool for obtaining an initial understanding of the data, usually employed for class discovery). The proposed self-organizing map (SOM)-based clustering algorithm automatically determines the number of ‘natural’ subgroups of the data, being aided at this task by the available prior knowledge of the functional categories of genes. An entropy criterion allows each gene to be assigned to multiple classes, which is closer to the biological representation. These features, however, are not achieved at the cost of the simplicity of the algorithm, since the map grows on a simple grid structure and the learning algorithm remains equal to Kohonen’s one.

Download Full-text

Time series clustering in large data sets

Acta Universitatis Agriculturae et Silviculturae Mendelianae Brunensis ◽

10.11118/actaun201159020075 ◽

2011 ◽

Vol 59 (2) ◽

pp. 75-80 ◽

Cited By ~ 4

Author(s):

Jiří Fejfar ◽

Jiří Šťastný

Keyword(s):

Time Series ◽

Digital Libraries ◽

Clustering Algorithm ◽

Learning Algorithm ◽

Large Data ◽

Data Sets ◽

Self Organizing Map ◽

Time Series Clustering ◽

Feature Vectors ◽

Cover Songs

The clustering of time series is a widely researched area. There are many methods for dealing with this task. We are actually using the Self-organizing map (SOM) with the unsupervised learning algorithm for clustering of time series. After the first experiment (Fejfar, Weinlichová, Šťastný, 2009) it seems that the whole concept of the clustering algorithm is correct but that we have to perform time series clustering on much larger dataset to obtain more accurate results and to find the correlation between configured parameters and results more precisely. The second requirement arose in a need for a well-defined evaluation of results. It seems useful to use sound recordings as instances of time series again. There are many recordings to use in digital libraries, many interesting features and patterns can be found in this area. We are searching for recordings with the similar development of information density in this experiment. It can be used for musical form investigation, cover songs detection and many others applications.The objective of the presented paper is to compare clustering results made with different parameters of feature vectors and the SOM itself. We are describing time series in a simplistic way evaluating standard deviations for separated parts of recordings. The resulting feature vectors are clustered with the SOM in batch training mode with different topologies varying from few neurons to large maps.There are other algorithms discussed, usable for finding similarities between time series and finally conclusions for further research are presented. We also present an overview of the related actual literature and projects.

Download Full-text

Insights into therapeutic targets and biomarkers using integrated multi-‘omics’ approaches for dilated and ischemic cardiomyopathies

Integrative Biology ◽

10.1093/intbio/zyab007 ◽

2021 ◽

Author(s):

Austė Kanapeckaitė ◽

Neringa Burokienė

Keyword(s):

Machine Learning ◽

Single Cell ◽

Learning Algorithm ◽

Expression Profiles ◽

Therapeutic Targets ◽

Development Stage ◽

Biological Data ◽

Specific Gene ◽

Tissue Remodelling ◽

Pharmacological Management

Abstract At present, heart failure (HF) treatment only targets the symptoms based on the left ventricle dysfunction severity; however, the lack of systemic ‘omics’ studies and available biological data to uncover the heterogeneous underlying mechanisms signifies the need to shift the analytical paradigm towards network-centric and data mining approaches. This study, for the first time, aimed to investigate how bulk and single cell RNA-sequencing as well as the proteomics analysis of the human heart tissue can be integrated to uncover HF-specific networks and potential therapeutic targets or biomarkers. We also aimed to address the issue of dealing with a limited number of samples and to show how appropriate statistical models, enrichment with other datasets as well as machine learning-guided analysis can aid in such cases. Furthermore, we elucidated specific gene expression profiles using transcriptomic and mined data from public databases. This was achieved using the two-step machine learning algorithm to predict the likelihood of the therapeutic target or biomarker tractability based on a novel scoring system, which has also been introduced in this study. The described methodology could be very useful for the target or biomarker selection and evaluation during the pre-clinical therapeutics development stage as well as disease progression monitoring. In addition, the present study sheds new light into the complex aetiology of HF, differentiating between subtle changes in dilated cardiomyopathies (DCs) and ischemic cardiomyopathies (ICs) on the single cell, proteome and whole transcriptome level, demonstrating that HF might be dependent on the involvement of not only the cardiomyocytes but also on other cell populations. Identified tissue remodelling and inflammatory processes can be beneficial when selecting targeted pharmacological management for DCs or ICs, respectively.

Download Full-text

NoRCE: non-coding RNA sets cis enrichment tool

BMC Bioinformatics ◽

10.1186/s12859-021-04112-9 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Gulden Olgun ◽

Afshan Nabi ◽

Oznur Tastan

Keyword(s):

Expression Patterns ◽

Target Prediction ◽

Enrichment Analysis ◽

Fruit Fly ◽

Relevant Information ◽

R Package ◽

Data Repository ◽

Biologically Relevant ◽

Gene Sets ◽

Data Files

Abstract Background While some non-coding RNAs (ncRNAs) are assigned critical regulatory roles, most remain functionally uncharacterized. This presents a challenge whenever an interesting set of ncRNAs needs to be analyzed in a functional context. Transcripts located close-by on the genome are often regulated together. This genomic proximity on the sequence can hint at a functional association. Results We present a tool, NoRCE, that performs cis enrichment analysis for a given set of ncRNAs. Enrichment is carried out using the functional annotations of the coding genes located proximal to the input ncRNAs. Other biologically relevant information such as topologically associating domain (TAD) boundaries, co-expression patterns, and miRNA target prediction information can be incorporated to conduct a richer enrichment analysis. To this end, NoRCE includes several relevant datasets as part of its data repository, including cell-line specific TAD boundaries, functional gene sets, and expression data for coding & ncRNAs specific to cancer. Additionally, the users can utilize custom data files in their investigation. Enrichment results can be retrieved in a tabular format or visualized in several different ways. NoRCE is currently available for the following species: human, mouse, rat, zebrafish, fruit fly, worm, and yeast. Conclusions NoRCE is a platform-independent, user-friendly, comprehensive R package that can be used to gain insight into the functional importance of a list of ncRNAs of any type. The tool offers flexibility to conduct the users’ preferred set of analyses by designing their own pipeline of analysis. NoRCE is available in Bioconductor and https://github.com/guldenolgun/NoRCE.

Download Full-text

An approach for document retrieval using cluster-based inverted indexing

Journal of Information Science ◽

10.1177/01655515211018401 ◽

2021 ◽

pp. 016555152110184

Author(s):

Gunjan Chandwani ◽

Anil Ahlawat ◽

Gaurav Dubey

Keyword(s):

High Performance ◽

Clustering Algorithm ◽

Pearson Correlation ◽

Relevant Information ◽

Document Retrieval ◽

Bhattacharyya Distance ◽

Data Set ◽

Query Matching ◽

Inverted Indexing ◽

Query Optimisation

Document retrieval plays an important role in knowledge management as it facilitates us to discover the relevant information from the existing data. This article proposes a cluster-based inverted indexing algorithm for document retrieval. First, the pre-processing is done to remove the unnecessary and redundant words from the documents. Then, the indexing of documents is done by the cluster-based inverted indexing algorithm, which is developed by integrating the piecewise fuzzy C-means (piFCM) clustering algorithm and inverted indexing. After providing the index to the documents, the query matching is performed for the user queries using the Bhattacharyya distance. Finally, the query optimisation is done by the Pearson correlation coefficient, and the relevant documents are retrieved. The performance of the proposed algorithm is analysed by the WebKB data set and Twenty Newsgroups data set. The analysis exposes that the proposed algorithm offers high performance with a precision of 1, recall of 0.70 and F-measure of 0.8235. The proposed document retrieval system retrieves the most relevant documents and speeds up the storing and retrieval of information.

Download Full-text

A Scalable Redefined Stochastic Blockmodel

ACM Transactions on Knowledge Discovery from Data ◽

10.1145/3442589 ◽

2021 ◽

Vol 15 (3) ◽

pp. 1-28

Author(s):

Xueyan Liu ◽

Bo Yang ◽

Hechang Chen ◽

Katarzyna Musial ◽

Hongxu Chen ◽

...

Keyword(s):

Large Scale ◽

Network Science ◽

Learning Algorithm ◽

State Of The Art ◽

Real World Data ◽

Computational Overhead ◽

Stochastic Blockmodel ◽

Np Hard Problem ◽

Large Scale Networks ◽

The Cost

Stochastic blockmodel (SBM) is a widely used statistical network representation model, with good interpretability, expressiveness, generalization, and flexibility, which has become prevalent and important in the field of network science over the last years. However, learning an optimal SBM for a given network is an NP-hard problem. This results in significant limitations when it comes to applications of SBMs in large-scale networks, because of the significant computational overhead of existing SBM models, as well as their learning methods. Reducing the cost of SBM learning and making it scalable for handling large-scale networks, while maintaining the good theoretical properties of SBM, remains an unresolved problem. In this work, we address this challenging task from a novel perspective of model redefinition. We propose a novel redefined SBM with Poisson distribution and its block-wise learning algorithm that can efficiently analyse large-scale networks. Extensive validation conducted on both artificial and real-world data shows that our proposed method significantly outperforms the state-of-the-art methods in terms of a reasonable trade-off between accuracy and scalability. 1

Download Full-text

Sparse Incremental Delta-Bar-Delta for System Identification

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.665.643 ◽

2014 ◽

Vol 665 ◽

pp. 643-646

Author(s):

Ying Liu ◽

Yan Ye ◽

Chun Guang Li

Keyword(s):

System Identification ◽

Cost Function ◽

Learning Algorithm ◽

Learning System ◽

The Other ◽

Sparse System ◽

Speed Up ◽

Sparse System Identification ◽

The Cost ◽

Zero Attractor

Metalearning algorithm learns the base learning algorithm, targeted for improving the performance of the learning system. The incremental delta-bar-delta (IDBD) algorithm is such a metalearning algorithm. On the other hand, sparse algorithms are gaining popularity due to their good performance and wide applications. In this paper, we propose a sparse IDBD algorithm by taking the sparsity of the systems into account. Thenorm penalty is contained in the cost function of the standard IDBD, which is equivalent to adding a zero attractor in the iterations, thus can speed up convergence if the system of interest is indeed sparse. Simulations demonstrate that the proposed algorithm is superior to the competing algorithms in sparse system identification.

Download Full-text

Development of a New KPI for the Economic Quantification of Six Big Losses and Its Implementation in a Cyber Physical System

Applied Sciences ◽

10.3390/app10249154 ◽

2020 ◽

Vol 10 (24) ◽

pp. 9154

Author(s):

Paula Morella ◽

María Pilar Lambán ◽

Jesús Royo ◽

Juan Carlos Sánchez ◽

Jaime Latapia

Keyword(s):

Real Time ◽

Physical System ◽

Performance Indicator ◽

Cost Model ◽

Relevant Information ◽

Cyber Physical System ◽

Time Data ◽

Real Time Data ◽

Different Dimensions ◽

The Cost

The purpose of this work is to develop a new Key Performance Indicator (KPI) that can quantify the cost of Six Big Losses developed by Nakajima and implements it in a Cyber Physical System (CPS), achieving a real-time monitorization of the KPI. This paper follows the methodology explained below. A cost model has been used to accurately develop this indicator together with the Six Big Losses description. At the same time, the machine tool has been integrated into a CPS, enhancing the real-time data acquisition, using the Industry 4.0 technologies. Once the KPI has been defined, we have developed the software that can turn these real-time data into relevant information (using Python) through the calculation of our indicator. Finally, we have carried out a case of study showing our new KPI results and comparing them to other indicators related with the Six Big Losses but in different dimensions. As a result, our research quantifies economically the Six Big Losses, enhances the detection of the bigger ones to improve them, and enlightens the importance of paying attention to different dimensions, mainly, the productive, sustainable, and economic at the same time.

Download Full-text

Identification of CCNB2 expression in triple-negative breast cancer based on bioinformatics results

10.21203/rs.3.rs-506326/v1 ◽

2021 ◽

Author(s):

jintao cao ◽

SHUAI SUN ◽

RAN LI ◽

RUI MIN ◽

XINGYU FAN ◽

...

Keyword(s):

Breast Cancer ◽

Gene Expression ◽

Triple Negative Breast Cancer ◽

Protein Complex ◽

Triple Negative ◽

Expression Profiles ◽

Pathway Enrichment Analysis ◽

Analysis Tool ◽

The Core ◽

Core Genes

Abstract Background The current epidemiology shows that the incidence of breast cancer is increasing year by year and tends to be younger. Triple-negative breast cancer is the most malignant of breast cancer subtypes. The application of bioinformatics in tumor research is becoming more and more extensive. This study provided research ideas and basis for exploring the potential targets of gene therapy for triple-negative breast cancer (TNBC). Methods We analyzed three gene expression profiles (GSE64790、GSE62931、GSE38959) selected from the Gene Expression Omnibus (GEO) database. The GEO2R online analysis tool was used to screen for differentially expressed genes (DEGs) between TNBC and normal tissues. Gene Ontology (GO) function and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analysis were applied to identify the pathways and functional annotation of DEGs. Protein–protein interaction network of these DEGs were visualized by the Metascape gene-list analysis tool so that we could find the protein complex containing the core genes. Subsequently, we investigated the transcriptional data of the core genes in patients with breast cancer from the Oncomine database. Moreover, the online Kaplan–Meier plotter survival analysis tool was used to evaluate the prognostic value of core genes expression in TNBC patients. Finally, immunohistochemistry (IHC) was used to evaluated the expression level and subcellular localization of CCNB2 on TNBC tissues. Results A total of 66 DEGs were identified, including 33 up-regulated genes and 33 down-regulated genes. Among them, a potential protein complex containing five core genes was screened out. The high expression of these core genes was correlated to the poor prognosis of patients suffering breast cancer, especially the overexpression of CCNB2. CCNB2 protein positively expressed in the cytoplasm, and its expression in triple-negative breast cancer tissues was significantly higher than that in adjacent tissues. Conclusions CCNB2 may play a crucial role in the development of TNBC and has the potential as a prognostic biomarker of TNBC.

Download Full-text

A Hybrid Method Based on Semi-Supervised Learning for Relation Extraction in Chinese EMRs (Preprint)

10.2196/preprints.28220 ◽

2021 ◽

Author(s):

ChunMing Yang

Keyword(s):

Supervised Learning ◽

Learning Algorithm ◽

Medical Knowledge ◽

Relation Extraction ◽

Small Scale ◽

Semantic Features ◽

Training Process ◽

Network Layers ◽

Relation Prediction ◽

The Cost

BACKGROUND Extracting relations between the entities from Chinese electronic medical records(EMRs) is the key to automatically constructing medical knowledge graphs. Due to the less available labeled corpus, most of the current researches are based on shallow networks, which cannot fully capture the complex semantic features in the text of Chinese EMRs. OBJECTIVE In this study, a hybrid deep learning method based on semi-supervised learning is proposed to extract the entity relations from small-scale complex Chinese EMRs. METHODS The semantic features of sentences are extracted by residual network (ResNet) and the long dependent information is captured by bidirectional GRU (Gated Recurrent Unit). Then the attention mechanism is used to assign weights to the extracted features respectively, and the output of the two attention mechanisms is integrated for relation prediction. We adjusted the training process with manually annotated small-scale relational corpus and bootstrapping semi-supervised learning algorithm, and continuously expanded the datasets during the training process. RESULTS The experimental results show that the best F1-score of the proposed method on the overall relation categories reaches 89.78%, which is 13.07% higher than the baseline CNN model. The F1-score on DAP, SAP, SNAP, TeRD, TeAP, TeCP, TeRS, TeAS, TrAD, TrRD and TrAP 11 relation categories reaches 80.95%, 93.91%, 92.96%, 88.43%, 86.54%, 85.58%, 87.96%, 94.74%, 93.01%, 87.58% and 95.48%, respectively. CONCLUSIONS The hybrid neural network method strengthens the feature transfer and reuse between different network layers and reduces the cost of manual tagging relations. The results demonstrate that our proposed method is effective for the relation extraction in Chinese EMRs.

Download Full-text

Quantification of Hepatorenal Index for Computer-Aided Fatty Liver Classification with Self-Organizing Map and Fuzzy Stretching from Ultrasonography

BioMed Research International ◽

10.1155/2015/535894 ◽

2015 ◽

Vol 2015 ◽

pp. 1-9 ◽

Cited By ~ 1

Author(s):

Kwang Baek Kim ◽

Chang Won Kim

Keyword(s):

Fatty Liver ◽

Hepatic Steatosis ◽

Learning Algorithm ◽

Fat Content ◽

Liver Fat ◽

Self Organizing Map ◽

Liver Fat Content ◽

Computer Aided ◽

Good Set ◽

Self Organizing

Accurate measures of liver fat content are essential for investigating hepatic steatosis. For a noninvasive inexpensive ultrasonographic analysis, it is necessary to validate the quantitative assessment of liver fat content so that fully automated reliable computer-aided software can assist medical practitioners without any operator subjectivity. In this study, we attempt to quantify the hepatorenal index difference between the liver and the kidney with respect to the multiple severity status of hepatic steatosis. In order to do this, a series of carefully designed image processing techniques, including fuzzy stretching and edge tracking, are applied to extract regions of interest. Then, an unsupervised neural learning algorithm, the self-organizing map, is designed to establish characteristic clusters from the image, and the distribution of the hepatorenal index values with respect to the different levels of the fatty liver status is experimentally verified to estimate the differences in the distribution of the hepatorenal index. Such findings will be useful in building reliable computer-aided diagnostic software if combined with a good set of other characteristic feature sets and powerful machine learning classifiers in the future.

Download Full-text