Revisiting Label Smoothing Regularization with Knowledge Distillation

Jiyue Wang; Pei Zhang; Qianhua He; Yanxiong Li; Yongjian Hu

doi:10.3390/app11104699

Revisiting Label Smoothing Regularization with Knowledge Distillation

Applied Sciences ◽

10.3390/app11104699 ◽

2021 ◽

Vol 11 (10) ◽

pp. 4699

Author(s):

Jiyue Wang ◽

Pei Zhang ◽

Qianhua He ◽

Yanxiong Li ◽

Yongjian Hu

Keyword(s):

Computational Cost ◽

Ground Truth ◽

New Teacher ◽

Model Output ◽

Leibler Divergence ◽

Two Component ◽

Knowledge Distillation ◽

The One ◽

Generalize Classification ◽

Teacher Model

Label Smoothing Regularization (LSR) is a widely used tool to generalize classification models by replacing the one-hot ground truth with smoothed labels. Recent research on LSR has increasingly focused on the correlation between the LSR and Knowledge Distillation (KD), which transfers the knowledge from a teacher model to a lightweight student model by penalizing their output’s Kullback–Leibler-divergence. Based on this observation, a Teacher-free Knowledge Distillation (Tf-KD) method was proposed in previous work. Instead of a real teacher model, a handcrafted distribution similar to LSR was used to guide the student learning. Tf-KD is a promising substitute for LSR except for its hard-to-tune and model-dependent hyperparameters. This paper develops a new teacher-free framework LSR-OS-TC, which decomposes the Tf-KD method into two components: model Output Smoothing (OS) and Teacher Correction (TC). Firstly, the LSR-OS extends the LSR method to the KD regime and applies a softer temperature to the model output softmax layer. Output smoothing is critical for stabilizing the KD hyperparameters among different models. Secondly, in the TC part, a larger proportion is assigned to the uniform distribution teacher’s right class to provide a more informative teacher. The two-component method was evaluated exhaustively on the image (dataset CIFAR-100, CIFAR-10, and CINIC-10) and audio (dataset GTZAN) classification tasks. The results showed that LSR-OS can improve LSR performance independently with no extra computational cost, especially on several deep neural networks where LSR is ineffective. The further training boost by the TC component showed the effectiveness of our two-component strategy. Overall, LSR-OS-TC is a practical substitution of LSR that can be tuned on one model and directly applied to other models compared to the original Tf-KD method.

Download Full-text

Novel Model Based on Stacked Autoencoders with Sample-Wise Strategy for Fault Diagnosis

Mathematical Problems in Engineering ◽

10.1155/2019/8985657 ◽

2019 ◽

Vol 2019 ◽

pp. 1-10

Author(s):

Diehao Kong ◽

Xuefeng Yan

Keyword(s):

Fault Diagnosis ◽

Chemical Engineering ◽

Ground Truth ◽

Student Model ◽

Teacher Student ◽

Stacked Autoencoders ◽

Knowledge Distillation ◽

New Perspective ◽

Current Student ◽

Teacher Model

Autoencoders are used for fault diagnosis in chemical engineering. To improve their performance, experts have paid close attention to regularized strategies and the creation of new and effective cost functions. However, existing methods are modified on the basis of only one model. This study provides a new perspective for strengthening the fault diagnosis model, which attempts to gain useful information from a model (teacher model) and applies it to a new model (student model). It pretrains the teacher model by fitting ground truth labels and then uses a sample-wise strategy to transfer knowledge from the teacher model. Finally, the knowledge and the ground truth labels are used to train the student model that is identical to the teacher model in terms of structure. The current student model is then used as the teacher of next student model. After step-by-step teacher-student reconfiguration and training, the optimal model is selected for fault diagnosis. Besides, knowledge distillation is applied in training procedures. The proposed method is applied to several benchmarked problems to prove its effectiveness.

Download Full-text

Comparing Kullback-Leibler Divergence and Mean Squared Error Loss in Knowledge Distillation

Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2021/362 ◽

2021 ◽

Author(s):

Taehyeon Kim ◽

Jaehoon Oh ◽

Nak Yil Kim ◽

Sangwook Cho ◽

Se-Young Yun

Keyword(s):

Mean Squared Error ◽

Probability Distributions ◽

Student Model ◽

Kl Divergence ◽

Squared Error ◽

Leibler Divergence ◽

Temperature Scaling ◽

Knowledge Distillation ◽

The Mean ◽

Teacher Model

Knowledge distillation (KD), transferring knowledge from a cumbersome teacher model to a lightweight student model, has been investigated to design efficient neural architectures. Generally, the objective function of KD is the Kullback-Leibler (KL) divergence loss between the softened probability distributions of the teacher model and the student model with the temperature scaling hyperparameter τ. Despite its widespread use, few studies have discussed how such softening influences generalization. Here, we theoretically show that the KL divergence loss focuses on the logit matching when τ increases and the label matching when τ goes to 0 and empirically show that the logit matching is positively correlated to performance improvement in general. From this observation, we consider an intuitive KD loss function, the mean squared error (MSE) between the logit vectors, so that the student model can directly learn the logit of the teacher model. The MSE loss outperforms the KL divergence loss, explained by the penultimate layer representations difference between the two losses. Furthermore, we show that sequential distillation can improve performance and that KD, using the KL divergence loss with small τ particularly, mitigates the label noise. The code to reproduce the experiments is publicly available online at https://github.com/jhoon-oh/kd_data/.

Download Full-text

Nanosecond Photodynamics Simulations of a Cis-Trans Isomerization Are Enabled by Machine Learning

10.26434/chemrxiv.13047863 ◽

2020 ◽

Author(s):

Jingbai Li ◽

Patrick Reiser ◽

André Eberhard ◽

Pascal Friederich ◽

Steven Lopez

Keyword(s):

Machine Learning ◽

Neural Networks ◽

Excited State ◽

Adaptive Sampling ◽

Computational Cost ◽

Ground Truth ◽

Absolute Error ◽

Photochemical Reactions ◽

Computational Techniques ◽

Full Potential

Photochemical reactions are being increasingly used to construct complex molecular architectures with mild and straightforward reaction conditions. Computational techniques are increasingly important to understand the reactivities and chemoselectivities of photochemical isomerization reactions because they offer molecular bonding information along the excited-state(s) of photodynamics. These photodynamics simulations are resource-intensive and are typically limited to 1–10 picoseconds and 1,000 trajectories due to high computational cost. Most organic photochemical reactions have excited-state lifetimes exceeding 1 picosecond, which places them outside possible computational studies. Westermeyr et al. demonstrated that a machine learning approach could significantly lengthen photodynamics simulation times for a model system, methylenimmonium cation (CH2NH2+).We have developed a Python-based code, Python Rapid Artificial Intelligence Ab Initio Molecular Dynamics (PyRAI2MD), to accomplish the unprecedented 10 ns cis-trans photodynamics of trans-hexafluoro-2-butene (CF3–CH=CH–CF3) in 3.5 days. The same simulation would take approximately 58 years with ground-truth multiconfigurational dynamics. We proposed an innovative scheme combining Wigner sampling, geometrical interpolations, and short-time quantum chemical trajectories to effectively sample the initial data, facilitating the adaptive sampling to generate an informative and data-efficient training set with 6,232 data points. Our neural networks achieved chemical accuracy (mean absolute error of 0.032 eV). Our 4,814 trajectories reproduced the S1 half-life (60.5 fs), the photochemical product ratio (trans: cis = 2.3: 1), and autonomously discovered a pathway towards a carbene. The neural networks have also shown the capability of generalizing the full potential energy surface with chemically incomplete data (trans → cis but not cis → trans pathways) that may offer future automated photochemical reaction discoveries.

Download Full-text

Analytical Model of Induction Machines with Multiple Cage Faults Using the Winding Tensor Approach

Sensors ◽

10.3390/s21155076 ◽

2021 ◽

Vol 21 (15) ◽

pp. 5076

Author(s):

Javier Martinez-Roman ◽

Ruben Puche-Panadero ◽

Angel Sapena-Bano ◽

Carla Terron-Santiago ◽

Jordi Burriel-Valencia ◽

...

Keyword(s):

Computational Cost ◽

Induction Machines ◽

Diagnostic Techniques ◽

Analytical Models ◽

Tensor Algebra ◽

Diagnostic Systems ◽

Technical Literature ◽

Twin Model ◽

Ring Segment ◽

The One

Induction machines (IMs) are one of the main sources of mechanical power in many industrial processes, especially squirrel cage IMs (SCIMs), due to their robustness and reliability. Their sudden stoppage due to undetected faults may cause costly production breakdowns. One of the most frequent types of faults are cage faults (bar and end ring segment breakages), especially in motors that directly drive high-inertia loads (such as fans), in motors with frequent starts and stops, and in case of poorly manufactured cage windings. A continuous monitoring of IMs is needed to reduce this risk, integrated in plant-wide condition based maintenance (CBM) systems. Diverse diagnostic techniques have been proposed in the technical literature, either data-based, detecting fault-characteristic perturbations in the data collected from the IM, and model-based, observing the differences between the data collected from the actual IM and from its digital twin model. In both cases, fast and accurate IM models are needed to develop and optimize the fault diagnosis techniques. On the one hand, the finite elements approach can provide highly accurate models, but its computational cost and processing requirements are very high to be used in on-line fault diagnostic systems. On the other hand, analytical models can be much faster, but they can be very complex in case of highly asymmetrical machines, such as IMs with multiple cage faults. In this work, a new method is proposed for the analytical modelling of IMs with asymmetrical cage windings using a tensor based approach, which greatly reduces this complexity by applying routine tensor algebra to obtain the parameters of the faulty IM model from the healthy one. This winding tensor approach is explained theoretically and validated with the diagnosis of a commercial IM with multiple cage faults.

Download Full-text

Directional TGV-Based Image Restoration under Poisson Noise

Journal of Imaging ◽

10.3390/jimaging7060099 ◽

2021 ◽

Vol 7 (6) ◽

pp. 99

Author(s):

Daniela di Serafino ◽

Germana Landi ◽

Marco Viola

Keyword(s):

Computed Tomography ◽

Image Restoration ◽

Gaussian Noise ◽

Computational Cost ◽

Data Fitting ◽

Poisson Noise ◽

Glass Fibres ◽

Leibler Divergence ◽

Generalized Variation ◽

Low Computational Cost

We are interested in the restoration of noisy and blurry images where the texture mainly follows a single direction (i.e., directional images). Problems of this type arise, for example, in microscopy or computed tomography for carbon or glass fibres. In order to deal with these problems, the Directional Total Generalized Variation (DTGV) was developed by Kongskov et al. in 2017 and 2019, in the case of impulse and Gaussian noise. In this article we focus on images corrupted by Poisson noise, extending the DTGV regularization to image restoration models where the data fitting term is the generalized Kullback–Leibler divergence. We also propose a technique for the identification of the main texture direction, which improves upon the techniques used in the aforementioned work about DTGV. We solve the problem by an ADMM algorithm with proven convergence and subproblems that can be solved exactly at a low computational cost. Numerical results on both phantom and real images demonstrate the effectiveness of our approach.

Download Full-text

The S66 Non-Covalent Interactions Benchmark Reconsidered Using Explicitly Correlated Methods Near the Basis Set Limit

Australian Journal of Chemistry ◽

10.1071/ch17588 ◽

2018 ◽

Vol 71 (4) ◽

pp. 238 ◽

Cited By ~ 13

Author(s):

Manoj K. Kesharwani ◽

Amir Karton ◽

Nitai Sylvetsky ◽

Jan M. L. Martin

Keyword(s):

Computational Cost ◽

The Other ◽

Basis Set ◽

Basis Sets ◽

Explicitly Correlated ◽

Non Covalent Interactions ◽

Explicitly Correlated Methods ◽

The One ◽

High Level ◽

Covalent Interactions

The S66 benchmark for non-covalent interactions has been re-evaluated using explicitly correlated methods with basis sets near the one-particle basis set limit. It is found that post-MP2 ‘high-level corrections’ are treated adequately well using a combination of CCSD(F12*) with (aug-)cc-pVTZ-F12 basis sets on the one hand, and (T) extrapolated from conventional CCSD(T)/heavy-aug-cc-pV{D,T}Z on the other hand. Implications for earlier benchmarks on the larger S66×8 problem set in particular, and for accurate calculations on non-covalent interactions in general, are discussed. At a slight cost in accuracy, (T) can be considerably accelerated by using sano-V{D,T}Z+ basis sets, whereas half-counterpoise CCSD(F12*)(T)/cc-pVDZ-F12 offers the best compromise between accuracy and computational cost.

Download Full-text

Correlators in the one-dimensional two-component Bose and Fermi gases

Physics Letters A ◽

10.1016/s0375-9601(97)00791-3 ◽

1997 ◽

Vol 236 (5-6) ◽

pp. 445-454 ◽

Cited By ~ 8

Author(s):

A.G. Izergin ◽

A.G. Pronko

Keyword(s):

Fermi Gases ◽

One Dimensional ◽

Two Component ◽

The One

Download Full-text

Towards Expert-Inspired Automatic Criterion to Cut a Dendrogram for Real-Industrial Applications

10.3233/faia210140 ◽

2021 ◽

Author(s):

Shikha Suman ◽

Ashutosh Karna ◽

Karina Gibert

Keyword(s):

Hierarchical Clustering ◽

Clustering Algorithms ◽

Computational Cost ◽

Real Life ◽

Ground Truth ◽

Industrial Applications ◽

Underlying Structure ◽

Cluster Validity ◽

Cluster Validity Index ◽

Number Of Clusters

Hierarchical clustering is one of the most preferred choices to understand the underlying structure of a dataset and defining typologies, with multiple applications in real life. Among the existing clustering algorithms, the hierarchical family is one of the most popular, as it permits to understand the inner structure of the dataset and find the number of clusters as an output, unlike popular methods, like k-means. One can adjust the granularity of final clustering to the goals of the analysis themselves. The number of clusters in a hierarchical method relies on the analysis of the resulting dendrogram itself. Experts have criteria to visually inspect the dendrogram and determine the number of clusters. Finding automatic criteria to imitate experts in this task is still an open problem. But, dependence on the expert to cut the tree represents a limitation in real applications like the fields industry 4.0 and additive manufacturing. This paper analyses several cluster validity indexes in the context of determining the suitable number of clusters in hierarchical clustering. A new Cluster Validity Index (CVI) is proposed such that it properly catches the implicit criteria used by experts when analyzing dendrograms. The proposal has been applied on a range of datasets and validated against experts ground-truth overcoming the results obtained by the State of the Art and also significantly reduces the computational cost.

Download Full-text

Considerations for performance metrics of metagenomic next generation sequencing analyses

10.1101/2020.12.17.423212 ◽

2020 ◽

Author(s):

Jason G. Kralj ◽

Stephanie L. Servetas ◽

Samuel P. Forry ◽

Scott A. Jackson

Keyword(s):

Performance Metrics ◽

Limit Of Detection ◽

Clinical Performance ◽

Ground Truth ◽

Negative Control ◽

Fitness For Purpose ◽

Performance Metric ◽

The One ◽

Sensitivity Specificity ◽

Harmonic Means

AbstractEvaluating the performance of metagenomics analyses has proven a challenge, due in part to limited ground-truth standards, broad application space, and numerous evaluation methods and metrics. Application of traditional clinical performance metrics (i.e. sensitivity, specificity, etc.) using taxonomic classifiers do not fit the “one-bug-one-test” paradigm. Ultimately, users need methods that evaluate fitness-for-purpose and identify their analyses’ strengths and weaknesses. Within a defined cohort, reporting performance metrics by taxon, rather than by sample, will clarify this evaluation. An estimated limit of detection, positive and negative control samples, and true positive and negative true results are necessary criteria for all investigated taxa. Use of summary metrics should be restricted to comparing results of similar cohorts and data, and should employ harmonic means and continuous products for each performance metric rather than arithmetic mean. Such consideration will ensure meaningful comparisons and evaluation of fitness-for-purpose.

Download Full-text

Y-Net: Dual-branch Joint Network for Semantic Segmentation

ACM Transactions on Multimedia Computing Communications and Applications ◽

10.1145/3460940 ◽

2021 ◽

Vol 17 (4) ◽

pp. 1-22

Author(s):

Yizhen Chen ◽

Haifeng Hu

Keyword(s):

Feature Vector ◽

State Of The Art ◽

Computational Cost ◽

Receptive Fields ◽

Semantic Segmentation ◽

Global Context ◽

Multi Level ◽

The One ◽

Public Datasets ◽

High Level

Most existing segmentation networks are built upon a “ U -shaped” encoder–decoder structure, where the multi-level features extracted by the encoder are gradually aggregated by the decoder. Although this structure has been proven to be effective in improving segmentation performance, there are two main drawbacks. On the one hand, the introduction of low-level features brings a significant increase in calculations without an obvious performance gain. On the other hand, general strategies of feature aggregation such as addition and concatenation fuse features without considering the usefulness of each feature vector, which mixes the useful information with massive noises. In this article, we abandon the traditional “ U -shaped” architecture and propose Y-Net, a dual-branch joint network for accurate semantic segmentation. Specifically, it only aggregates the high-level features with low-resolution and utilizes the global context guidance generated by the first branch to refine the second branch. The dual branches are effectively connected through a Semantic Enhancing Module, which can be regarded as the combination of spatial attention and channel attention. We also design a novel Channel-Selective Decoder (CSD) to adaptively integrate features from different receptive fields by assigning specific channelwise weights, where the weights are input-dependent. Our Y-Net is capable of breaking through the limit of singe-branch network and attaining higher performance with less computational cost than “ U -shaped” structure. The proposed CSD can better integrate useful information and suppress interference noises. Comprehensive experiments are carried out on three public datasets to evaluate the effectiveness of our method. Eventually, our Y-Net achieves state-of-the-art performance on PASCAL VOC 2012, PASCAL Person-Part, and ADE20K dataset without pre-training on extra datasets.

Download Full-text