Breast and Colon Cancer Classification from Gene Expression Profiles Using Data Mining Techniques

Mohamed Loey Ramadan AbdElNabi; Mohammed Wajeeh Jasim; Hazem M. EL-Bakry; Mohamed Hamed N. Taha; Nour Eldeen M. Khalifa

doi:10.3390/sym12030408

Breast and Colon Cancer Classification from Gene Expression Profiles Using Data Mining Techniques

Symmetry ◽

10.3390/sym12030408 ◽

2020 ◽

Vol 12 (3) ◽

pp. 408 ◽

Cited By ~ 3

Author(s):

Mohamed Loey Ramadan AbdElNabi ◽

Mohammed Wajeeh Jasim ◽

Hazem M. EL-Bakry ◽

Mohamed Hamed N. Taha ◽

Nour Eldeen M. Khalifa

Keyword(s):

Gene Expression ◽

Classification Accuracy ◽

Information Gain ◽

Expression Profiles ◽

Gene Expression Profiles ◽

Disease Diagnosis ◽

Performance Measure ◽

Support Vector ◽

Svm Classifier ◽

Cancer Type

Early detection of cancer increases the probability of recovery. This paper presents an intelligent decision support system (IDSS) for the early diagnosis of cancer based on gene expression profiles collected using DNA microarrays. Such datasets pose a challenge because of the small number of samples (no more than a few hundred) relative to the large number of genes (in the order of thousands). Therefore, a method of reducing the number of features (genes) that are not relevant to the disease of interest is necessary to avoid overfitting. The proposed methodology uses the information gain (IG) to select the most important features from the input patterns. Then, the selected features (genes) are reduced by applying the grey wolf optimization (GWO) algorithm. Finally, the methodology employs a support vector machine (SVM) classifier for cancer type classification. The proposed methodology was applied to two datasets (Breast and Colon) and was evaluated based on its classification accuracy, which is the most important performance measure in disease diagnosis. The experimental results indicate that the proposed methodology is able to enhance the stability of the classification accuracy as well as the feature selection.

Download Full-text

Breast and Colon Cancer Classification from Gene Expression Profiles Using Data Mining Techniques

10.20944/preprints202002.0324.v1 ◽

2020 ◽

Cited By ~ 1

Author(s):

Mohamed Loey ◽

Mohammed Wajeeh Jasim ◽

Hazem M. EL-Bakry ◽

Mohamed Hamed N. Taha ◽

Nour Eldeen M. Khalifa

Keyword(s):

Gene Expression ◽

Classification Accuracy ◽

Information Gain ◽

Expression Profiles ◽

Gene Expression Profiles ◽

Disease Diagnosis ◽

Performance Measure ◽

Support Vector ◽

Svm Classifier ◽

Cancer Type

Early detection of cancer increases the probability of recovery. This paper presents an intelligent decision support system (IDSS) for the early diagnosis of cancer based on gene expression profiles collected using DNA microarrays. Such datasets pose a challenge because of the small number of samples (no more than a few hundred) relative to the large number of genes (on the order of thousands). Therefore, a method of reducing the number of features (genes) that are not relevant to the disease of interest is necessary to avoid overfitting. The proposed methodology uses the information gain (IG) to select the most important features from the input patterns. Then, the selected features (genes) are reduced by applying the grey wolf optimization (GWO) algorithm. Finally, the methodology employs a support vector machine (SVM) classifier for cancer type classification. The proposed methodology was applied to two datasets (Breast and Colon) and was evaluated based on its classification accuracy, which is the most important performance measure in disease diagnosis. The experimental results indicate that the proposed methodology is able to enhance the stability of the classification accuracy as well as the feature selection

Download Full-text

Tumor Classification Using High-Order Gene Expression Profiles Based on Multilinear ICA

Advances in Bioinformatics ◽

10.1155/2009/926450 ◽

2009 ◽

Vol 2009 ◽

pp. 1-9 ◽

Cited By ~ 3

Author(s):

Ming-gang Du ◽

Shan-Wen Zhang ◽

Hong Wang

Keyword(s):

Gene Expression ◽

Expression Profiles ◽

Gene Expression Profiles ◽

High Order ◽

Tumor Classification ◽

Support Vector ◽

Svm Classifier ◽

Tumor Subtypes ◽

Components Analysis ◽

Insight Into

Motivation. Independent Components Analysis (ICA) maximizes the statistical independence of the representational components of a training gene expression profiles (GEP) ensemble, but it cannot distinguish relations between the different factors, or different modes, and it is not available to high-order GEP Data Mining. In order to generalize ICA, we introduce Multilinear-ICA and apply it to tumor classification using high order GEP. Firstly, we introduce the basis conceptions and operations of tensor and recommend Support Vector Machine (SVM) classifier and Multilinear-ICA. Secondly, the higher score genes of original high order GEP are selected by using t-statistics and tabulate tensors. Thirdly, the tensors are performed by Multilinear-ICA. Finally, the SVM is used to classify the tumor subtypes. Results. To show the validity of the proposed method, we apply it to tumor classification using high order GEP. Though we only use three datasets, the experimental results show that the method is effective and feasible. Through this survey, we hope to gain some insight into the problem of high order GEP tumor classification, in aid of further developing more effective tumor classification algorithms.

Download Full-text

Screening of characteristic genes in ulcerative colitis by integrating gene expression profiles

BMC Gastroenterology ◽

10.1186/s12876-021-01940-0 ◽

2021 ◽

Vol 21 (1) ◽

Author(s):

Yingbo Han ◽

Xiumin Liu ◽

Hongmei Dong ◽

Dacheng Wen

Keyword(s):

Gene Expression ◽

Ulcerative Colitis ◽

Expression Profiles ◽

Interaction Network ◽

Disease Diagnosis ◽

Training Dataset ◽

Recursive Feature Elimination ◽

Receptor Interaction ◽

Support Vector ◽

Svm Classifier

Abstract Background This study aimed to screen the feature modules and characteristic genes related to ulcerative colitis (UC) and construct a support vector machine (SVM) classifier to distinguish UC patients. Methods Four datasets that contained UC and control samples were obtained from the Gene Expression Omnibus database. Differentially expressed genes (DEGs) with consistency were screened via the MetaDE method. The weighted gene coexpression network (WGCNA) was used to distinguish significant modules based on the four datasets. The protein–protein interaction network was established based on intersection genes. Enrichment analysis of Gene Ontology (GO) biological processes (BPs) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment were established based on DAVID. An SVM combined with recursive feature elimination was also applied to construct a disease classifier for the disease diagnosis of UC patients. The efficacy of the SVM classifier was evaluated through receiver operating characteristic curves. Results Twelve highly preserved modules were obtained using the WGCNA, and 2009 DEGs with significant consistency were selected using the MetaDE method. Sixteen significantly related GO BPs and 12 KEGG pathways were obtained, such as cytokine-cytokine receptor interaction, cell adhesion molecules, and leukocyte transendothelial migration. Subsequently, 41 genes were used to construct an SVM classifier, such as CXCL1, CCR2, IL1B, and IL1A. The area under the curve (AUC) was 0.999 in the training dataset, whereas the AUC was 0.886, 0.790, and 0.819 in the validation set (GSE65114, GSE37283, and GSE36807, respectively). Conclusions An SVM classifier based on feature genes might correctly identify healthy people or UC patients.

Download Full-text

A Robust Gene selection Method for Microarray-based Cancer Classification

Cancer Informatics ◽

10.4137/cin.s3794 ◽

2010 ◽

Vol 9 ◽

pp. CIN.S3794 ◽

Cited By ~ 21

Author(s):

Xiaosheng Wang ◽

Osamu Gotoh

Keyword(s):

Gene Expression ◽

Feature Selection ◽

Gene Selection ◽

Information Gain ◽

Expression Profiles ◽

Feature Selection Method ◽

Gene Expression Profiles ◽

Molecular Classification ◽

Selection Method ◽

Chi Square

Gene selection is of vital importance in molecular classification of cancer using high-dimensional gene expression data. Because of the distinct characteristics inherent to specific cancerous gene expression profiles, developing flexible and robust feature selection methods is extremely crucial. We investigated the properties of one feature selection approach proposed in our previous work, which was the generalization of the feature selection method based on the depended degree of attribute in rough sets. We compared the feature selection method with the established methods: the depended degree, chi-square, information gain, Relief-F and symmetric uncertainty, and analyzed its properties through a series of classification experiments. The results revealed that our method was superior to the canonical depended degree of attribute based method in robustness and applicability. Moreover, the method was comparable to the other four commonly used methods. More importantly, the method can exhibit the inherent classification difficulty with respect to different gene expression datasets, indicating the inherent biology of specific cancers.

Download Full-text

Cancer classification of single-cell gene expression data by neural network

Bioinformatics ◽

10.1093/bioinformatics/btz772 ◽

2019 ◽

Cited By ~ 3

Author(s):

Bong-Hyun Kim ◽

Kijin Yu ◽

Peter C W Lee

Keyword(s):

Neural Network ◽

Gene Expression ◽

Single Cell ◽

Expression Profiles ◽

Gene Expression Profiles ◽

Cancer Classification ◽

Supplementary Information ◽

Support Vector ◽

K Nearest Neighbors ◽

Normal Tissues

Abstract Motivation Cancer classification based on gene expression profiles has provided insight on the causes of cancer and cancer treatment. Recently, machine learning-based approaches have been attempted in downstream cancer analysis to address the large differences in gene expression values, as determined by single-cell RNA sequencing (scRNA-seq). Results We designed cancer classifiers that can identify 21 types of cancers and normal tissues based on bulk RNA-seq as well as scRNA-seq data. Training was performed with 7398 cancer samples and 640 normal samples from 21 tumors and normal tissues in TCGA based on the 300 most significant genes expressed in each cancer. Then, we compared neural network (NN), support vector machine (SVM), k-nearest neighbors (kNN) and random forest (RF) methods. The NN performed consistently better than other methods. We further applied our approach to scRNA-seq transformed by kNN smoothing and found that our model successfully classified cancer types and normal samples. Availability and implementation Cancer classification by neural network. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

New Short Term Prediction Method for Chemical Carcinogenicity by Hepatic Transcript Profiling following 28-Day Toxicity Tests in Rats

Cancer Informatics ◽

10.4137/cin.s7789 ◽

2011 ◽

Vol 10 ◽

pp. CIN.S7789 ◽

Cited By ~ 4

Author(s):

Hiroshi Matsumoto ◽

Yoshikuni Yakabe ◽

Fumiyo Saito ◽

Koichi Saito ◽

Kayo Sumida ◽

...

Keyword(s):

Gene Expression ◽

Expression Profiles ◽

Gene Expression Profiles ◽

Prediction Method ◽

Predictive Score ◽

Toxicity Tests ◽

Support Vector ◽

Total System ◽

Prediction Formula ◽

Group 1

We have previously shown the hepatic gene expression profiles of carcinogens in 28-day toxicity tests were clustered into three major groups (Group-1 to 3). Here, we developed a new prediction method for Group-1 carcinogens which consist mainly of genotoxic rat hepatocarcinogens. The prediction formula was generated by a support vector machine using 5 selected genes as the predictive genes and predictive score was introduced to judge carcinogenicity. It correctly predicted the carcinogenicity of all 17 Group-1 chemicals and 22 of 24 non-carcinogens regardless of genotoxicity. In the dose-response study, the prediction score was altered from negative to positive as the dose increased, indicating that the characteristic gene expression profile emerged over a range of carcinogen-specific doses. We conclude that the prediction formula can quantitatively predict the carcinogenicity of Group-1 carcinogens. The same method may be applied to other groups of carcinogens to build a total system for prediction of carcinogenicity.

Download Full-text

Molecular diagnosis of human cancer type by gene expression profiles and independent component analysis

European Journal of Human Genetics ◽

10.1038/sj.ejhg.5201495 ◽

2005 ◽

Vol 13 (12) ◽

pp. 1303-1311 ◽

Cited By ~ 43

Author(s):

Xue Wu Zhang ◽

Yee Leng Yap ◽

Dong Wei ◽

Feng Chen ◽

Antoine Danchin

Keyword(s):

Gene Expression ◽

Independent Component Analysis ◽

Molecular Diagnosis ◽

Human Cancer ◽

Expression Profiles ◽

Gene Expression Profiles ◽

Component Analysis ◽

Independent Component ◽

Cancer Type

Download Full-text

Prediction of Breast Cancer Metastasis by Gene Expression Profiles: A Comparison of Metagenes and Single Genes

Cancer Informatics ◽

10.4137/cin.s10375 ◽

2012 ◽

Vol 11 ◽

pp. CIN.S10375 ◽

Cited By ~ 3

Author(s):

Mark Burton ◽

Mads Thomassen ◽

Qihua Tan ◽

Torben A. Kruse

Keyword(s):

Gene Expression ◽

Cross Validation ◽

Expression Profiles ◽

Gene Expression Profiles ◽

Microarray Platform ◽

Support Vector ◽

Independent Data ◽

Performance Difference ◽

Feature Sets ◽

Prediction Of Metastasis

Background The popularity of a large number of microarray applications has in cancer research led to the development of predictive or prognostic gene expression profiles. However, the diversity of microarray platforms has made the full validation of such profiles and their related gene lists across studies difficult and, at the level of classification accuracies, rarely validated in multiple independent datasets. Frequently, while the individual genes between such lists may not match, genes with same function are included across such gene lists. Development of such lists does not take into account the fact that genes can be grouped together as metagenes (MGs) based on common characteristics such as pathways, regulation, or genomic location. Such MGs might be used as features in building a predictive model applicable for classifying independent data. It is, therefore, demanding to systematically compare independent validation of gene lists or classifiers based on metagene or individual gene (SG) features. Methods In this study we compared the performance of either metagene- or single gene-based feature sets and classifiers using random forest and two support vector machines for classifier building. The performance within the same dataset, feature set validation performance, and validation performance of entire classifiers in strictly independent datasets were assessed by 10 times repeated 10-fold cross validation, leave-one-out cross validation, and one-fold validation, respectively. To test the significance of the performance difference between MG- and SG-features/classifiers, we used a repeated down-sampled binomial test approach. Results MG- and SG-feature sets are transferable and perform well for training and testing prediction of metastasis outcome in strictly independent data sets, both between different and within similar microarray platforms, while classifiers had a poorer performance when validated in strictly independent datasets. The study showed that MG- and SG-feature sets perform equally well in classifying independent data. Furthermore, SG-classifiers significantly outperformed MG-classifier when validation is conducted between datasets using similar platforms, while no significant performance difference was found when validation was performed between different platforms. Conclusion Prediction of metastasis outcome in lymph node–negative patients by MG- and SG-classifiers showed that SG-classifiers performed significantly better than MG-classifiers when validated in independent data based on the same microarray platform as used for developing the classifier. However, the MG- and SG-classifiers had similar performance when conducting classifier validation in independent data based on a different microarray platform. The latter was also true when only validating sets of MG- and SG-features in independent datasets, both between and within similar and different platforms.

Download Full-text

Breast Cancer Case Identification Based on Deep Learning and Bioinformatics Analysis

Frontiers in Genetics ◽

10.3389/fgene.2021.628136 ◽

2021 ◽

Vol 12 ◽

Author(s):

Dongfang Jia ◽

Cheng Chen ◽

Chen Chen ◽

Fangfang Chen ◽

Ningrui Zhang ◽

...

Keyword(s):

Breast Cancer ◽

Neural Network ◽

Gene Expression ◽

Expression Profiles ◽

Differential Expression Analysis ◽

Gene Expression Profiles ◽

Diagnostic Methods ◽

The Cancer Genome Atlas ◽

Support Vector ◽

Hub Genes

Mastering the molecular mechanism of breast cancer (BC) can provide an in-depth understanding of BC pathology. This study explored existing technologies for diagnosing BC, such as mammography, ultrasound, magnetic resonance imaging (MRI), computed tomography (CT), and positron emission tomography (PET) and summarized the disadvantages of the existing cancer diagnosis. The purpose of this article is to use gene expression profiles of The Cancer Genome Atlas (TCGA) and Gene Expression Omnibus (GEO) to classify BC samples and normal samples. The method proposed in this article triumphs over some of the shortcomings of traditional diagnostic methods and can conduct BC diagnosis more rapidly with high sensitivity and have no radiation. This study first selected the genes most relevant to cancer through weighted gene co-expression network analysis (WGCNA) and differential expression analysis (DEA). Then it used the protein–protein interaction (PPI) network to screen 23 hub genes. Finally, it used the support vector machine (SVM), decision tree (DT), Bayesian network (BN), artificial neural network (ANN), convolutional neural network CNN-LeNet and CNN-AlexNet to process the expression levels of 23 hub genes. For gene expression profiles, the ANN model has the best performance in the classification of cancer samples. The ten-time average accuracy is 97.36% (±0.34%), the F1 value is 0.8535 (±0.0260), the sensitivity is 98.32% (±0.32%), the specificity is 89.59% (±3.53%) and the AUC is 0.99. In summary, this method effectively classifies cancer samples and normal samples and provides reasonable new ideas for the early diagnosis of cancer in the future.

Download Full-text

SVM-RFE Based Feature Selection and Taguchi Parameters Optimization for Multiclass SVM Classifier

The Scientific World JOURNAL ◽

10.1155/2014/795624 ◽

2014 ◽

Vol 2014 ◽

pp. 1-10 ◽

Cited By ~ 30

Author(s):

Mei-Ling Huang ◽

Yung-Hsiang Hung ◽

W. M. Lee ◽

R. K. Li ◽

Bo-Ru Jiang

Keyword(s):

Feature Selection ◽

Classification Accuracy ◽

Explanatory Power ◽

Disease Diagnosis ◽

Parameters Optimization ◽

Recursive Feature Elimination ◽

Support Vector ◽

Svm Classifier ◽

Classification Problems ◽

Class Variable

Recently, support vector machine (SVM) has excellent performance on classification and prediction and is widely used on disease diagnosis or medical assistance. However, SVM only functions well on two-group classification problems. This study combines feature selection and SVM recursive feature elimination (SVM-RFE) to investigate the classification accuracy of multiclass problems for Dermatology and Zoo databases. Dermatology dataset contains 33 feature variables, 1 class variable, and 366 testing instances; and the Zoo dataset contains 16 feature variables, 1 class variable, and 101 testing instances. The feature variables in the two datasets were sorted in descending order by explanatory power, and different feature sets were selected by SVM-RFE to explore classification accuracy. Meanwhile, Taguchi method was jointly combined with SVM classifier in order to optimize parametersCandγto increase classification accuracy for multiclass classification. The experimental results show that the classification accuracy can be more than 95% after SVM-RFE feature selection and Taguchi parameter optimization for Dermatology and Zoo databases.

Download Full-text