scholarly journals Data Balancing Based on Pre-Training Strategy for Liver Segmentation from CT Scans

2019 ◽  
Vol 9 (9) ◽  
pp. 1825 ◽  
Author(s):  
Yong Zhang ◽  
Yi Wang ◽  
Yizhu Wang ◽  
Bin Fang ◽  
Wei Yu ◽  
...  

Data imbalance is often encountered in deep learning process and is harmful to model training. The imbalance of hard and easy samples in training datasets often occurs in the segmentation tasks from Contrast Tomography (CT) scans. However, due to the strong similarity between adjacent slices in volumes and different segmentation tasks (the same slice may be classified as a hard sample in liver segmentation task, but an easy sample in the kidney or spleen segmentation task), it is hard to solve this imbalance of training dataset using traditional methods. In this work, we use a pre-training strategy to distinguish hard and easy samples, and then increase the proportion of hard slices in training dataset, which could mitigate imbalance of hard samples and easy samples in training dataset, and enhance the contribution of hard samples in training process. Our experiments on liver, kidney and spleen segmentation show that increasing the ratio of hard samples in the training dataset could enhance the prediction ability of model by improving its ability to deal with hard samples. The main contribution of this work is the application of pre-training strategy, which enables us to select training samples online according to different tasks and to ease data imbalance in the training dataset.

Author(s):  
L. Hang ◽  
G. Y. Cai

Abstract. The detection and reconstruction of building have attracted more attention in the community of remote sensing and computer vision. Light detection and ranging (LiDAR) has been proved to be a good way to extract building roofs, while we have to face the problem of data shortage for most of the time. In this paper, we tried to extract the building roofs from very high resolution (VHR) images of Chinese satellite Gaofen-2 by employing convolutional neural network (CNN). It has been proved that the CNN is of a higher capability of recognizing detailed features which may not be classified out by object-based classification approach. Several major steps are concerned in this study, such as generation of training dataset, model training, image segmentation and building roofs recognition. First, urban objects such as trees, roads, squares and buildings were classified based on random forest algorithm by an object-oriented classification approach, the building regions were separated from other classes at the aid of visually interpretation and correction; Next, different types of building roofs mainly categorized by color and size information were trained using the trained CNN. Finally, the industrial and residential building roofs have been recognized individually and the results have been validated individually. The assessment results prove effectiveness of the proposed method with approximately 91% and 88% of quality rates in detection industrial and residential building roofs, respectively. Which means that the CNN approach is prospecting in detecting buildings with a very higher accuracy.


Author(s):  
Ernesto Escobedo ◽  
Liliana Arguello ◽  
Marzia Sepe ◽  
Ilaria Parrella ◽  
Stefano Cioncolini ◽  
...  

Abstract The monitoring and diagnostics of Industrial systems is increasing in complexity with larger volume of data collected and with many methods and analytics able to correlate data and events. The setup and training of these methods and analytics are one of the impacting factors in the selection of the most appropriate solution to provide an efficient and effective service, that requires the selection of the most suitable data set for training of models with consequent need of time and knowledge. The study and the related experiences proposed in this paper describe a methodology for tracking features, detecting outliers and derive, in a probabilistic way, diagnostic thresholds to be applied by means of hierarchical models that simplify or remove the selection of the proper training dataset by a subject matter expert at any deployment. This method applies to Industrial systems employing a large number of similar machines connected to a remote data center, with the purpose to alert one or more operators when a feature exceeds the healthy distribution. Some relevant use cases are presented for an aeroderivative gas turbine covering also its auxiliary equipment, with deep dive on the hydraulic starting system. The results, in terms of early anomaly detection and reduced model training effort, are compared with traditional monitoring approaches like fixed threshold. Moreover, this study explains the advantages of this probabilistic approach in a business application like the fleet monitoring and diagnostic advanced services.


Symmetry ◽  
2019 ◽  
Vol 11 (2) ◽  
pp. 133 ◽  
Author(s):  
Yang Li ◽  
Ying Lv ◽  
Suge Wang ◽  
Jiye Liang ◽  
Juanzi Li ◽  
...  

A large-scale and high-quality training dataset is an important guarantee to learn an ideal classifier for text sentiment classification. However, manually constructing such a training dataset with sentiment labels is a labor-intensive and time-consuming task. Therefore, based on the idea of effectively utilizing unlabeled samples, a synthetical framework that covers the whole process of semi-supervised learning from seed selection, iterative modification of the training text set, to the co-training strategy of the classifier is proposed in this paper for text sentiment classification. To provide an important basis for selecting the seed texts and modifying the training text set, three kinds of measures—the cluster similarity degree of an unlabeled text, the cluster uncertainty degree of a pseudo-label text to a learner, and the reliability degree of a pseudo-label text to a learner—are defined. With these measures, a seed selection method based on Random Swap clustering, a hybrid modification method of the training text set based on active learning and self-learning, and an alternately co-training strategy of the ensemble classifier of the Maximum Entropy and Support Vector Machine are proposed and combined into our framework. The experimental results on three Chinese datasets (COAE2014, COAE2015, and a Hotel review, respectively) and five English datasets (Books, DVD, Electronics, Kitchen, and MR, respectively) in the real world verify the effectiveness of the proposed framework.


2009 ◽  
Vol 18 (06) ◽  
pp. 853-881 ◽  
Author(s):  
TODOR GANCHEV

In the present contribution we propose an integral training procedure for the Locally Recurrent Probabilistic Neural Networks (LR PNNs). Specifically, the adjustment of the smoothing factor "sigma" in the pattern layer of the LR PNN and the training of the recurrent layer weights are integrated in an automatic process that iteratively estimates all adjustable parameters of the LR PNN from the available training data. Furthermore, in contrast to the original LR PNN, whose recurrent layer was trained to provide optimum separation among the classes on the training dataset, while striving to keep a balance between the learning rates for all classes, here the training strategy is oriented towards optimizing the overall classification accuracy, straightforwardly. More precisely, the new training strategy directly targets at maximizing the posterior probabilities for the target class and minimizing the posterior probabilities estimated for the non-target classes. The new fitness function requires fewer computations for each evaluation, and therefore the overall computational demands for training the recurrent layer weights are reduced. The performance of the integrated training procedure is illustrated on three different speech processing tasks: emotion recognition, speaker identification and speaker verification.


2014 ◽  
Author(s):  
Ahmed Draoua ◽  
Adélaïde Albouy-Kissi ◽  
Antoine Vacavant ◽  
Vincent Sauvage

2021 ◽  
Vol 33 (5) ◽  
pp. 83-104
Author(s):  
Aleksandr Igorevich Getman ◽  
Maxim Nikolaevich Goryunov ◽  
Andrey Georgievich Matskevich ◽  
Dmitry Aleksandrovich Rybolovlev

The paper discusses the issues of training models for detecting computer attacks based on the use of machine learning methods. The results of the analysis of publicly available training datasets and tools for analyzing network traffic and identifying features of network sessions are presented sequentially. The drawbacks of existing tools and possible errors in the datasets formed with their help are noted. It is concluded that it is necessary to collect own training data in the absence of guarantees of the public datasets reliability and the limited use of pre-trained models in networks with characteristics that differ from the characteristics of the network in which the training traffic was collected. A practical approach to generating training data for computer attack detection models is proposed. The proposed solutions have been tested to evaluate the quality of model training on the collected data and the quality of attack detection in conditions of real network infrastructure.


2021 ◽  
Vol 2021 ◽  
pp. 1-15
Author(s):  
Pan Ruchong ◽  
Tang Haiping ◽  
Wang Xiang

Background. Differentiated thyroid cancer (DTC) is the most common type of thyroid tumor with a high recurrence rate. Here, we developed a nomogram to effectively predict postoperative disease-free survival (DFS) in DTC patients. Methods. The mRNA expressions and clinical data of DTC patients were downloaded from the Cancer Genome Atlas (TCGA) and the Gene Expression Omnibus (GEO) database. Seventy percent of patients were randomly selected as the training dataset, and thirty percent of patients were classified into the testing dataset. Multivariate Cox regression analysis was adopted to establish a nomogram to predict 1-year, 3-year, and 5-year DFS rate of DTC patients. Results. A five-gene signature comprised of TENM1, FN1, APOD, F12, and BTNL8 genes was established to predict the DFS rate of DTC patients. Results from the concordance index (C-index), area under curve (AUC), and calibration curve showed that both the training dataset and the testing dataset exhibited good prediction ability, and they were superior to other traditional models. The risk score and distant metastasis (M) of the five-gene signature were independent risk factors that affected DTC recurrence. A nomogram that could predict 1-year, 3-year, and 5-year DFS rate of DTC patients was established with a C-index of 0.801 (95% CI: 0.736, 0.866). Conclusion. Our study developed a prediction model based on the gene expression and clinical characteristics to predict the DFS rate of DTC patients, which may be applied to more accurately assess patient prognosis and individualized treatment.


2021 ◽  
Author(s):  
Sayedali Shetab Boushehri ◽  
Ahmad Qasim ◽  
Dominik Waibel ◽  
Fabian Schmich ◽  
Carsten Marr

Abstract Deep learning based classification of biomedical images requires manual annotation by experts, which is time-consuming and expensive. Incomplete-supervision approaches including active learning, pre-training and semi-supervised learning address this issue and aim to increase classification performance with a limited number of annotated images. Up to now, these approaches have been mostly benchmarked on natural image datasets, where image complexity and class balance typically differ considerably from biomedical classification tasks. In addition, it is not clear how to combine them to improve classification performance on biomedical image data. We thus performed an extensive grid search combining seven active learning algorithms, three pre-training methods and two training strategies as well as respective baselines (random sampling, random initialization, and supervised learning). For four biomedical datasets, we started training with 1% of labeled data and increased it by 5% iteratively, using 4-fold cross-validation in each cycle. We found that the contribution of pre-training and semi-supervised learning can reach up to 20% macro F1-score in each cycle. In contrast, the state-of-the-art active learning algorithms contribute less than 5% to macro F1-score in each cycle. Based on performance, implementation ease and computation requirements, we recommend the combination of BADGE active learning, ImageNet-weights pre-training, and pseudo-labeling as training strategy, which reached over 90% of fully supervised results with only 25% of annotated data for three out of four datasets. We believe that our study is an important step towards annotation and resource efficient model training for biomedical classification challenges.


Sign in / Sign up

Export Citation Format

Share Document