Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Jun 18:2023:1-8.
doi: 10.1109/IJCNN54540.2023.10191991.

SMKD: Selective Mutual Knowledge Distillation

Affiliations

SMKD: Selective Mutual Knowledge Distillation

Ziyun Li et al. Proc Int Jt Conf Neural Netw. .

Abstract

Mutual knowledge distillation (MKD) is a technique used to transfer knowledge between multiple models in a collaborative manner. However, it is important to note that not all knowledge is accurate or reliable, particularly under challenging conditions such as label noise, which can lead to models that memorize undesired information. This problem can be addressed by improving the reliability of the knowledge source, as well as selectively selecting reliable knowledge for distillation. While making a model more reliable is a widely studied topic, selective MKD has received less attention. To address this, we propose a new framework called selective mutual knowledge distillation (SMKD). The key component of SMKD is a generic knowledge selection formulation, which allows for either static or progressive selection thresholds. Additionally, SMKD covers two special cases: using no knowledge and using all knowledge, resulting in a unified MKD framework. We present extensive experimental results to demonstrate the effectiveness of SMKD and justify its design.

PubMed Disclaimer

Figures

Fig. 1
Fig. 1
Comparison of conventional MKD and our SMKD. Dotted frames represent components from model A and solid frames represent components from model B. pA and pB are predictions from mode A and model B, respectively. In (b), q~A and q~B represent the refined labels by a self distillation method, and χ is the threshold to decide whether the prediction is confident enough or not. H(p) denotes the entropy of p, and H(q, p) is the cross entropy loss between q and p.
Fig. 2
Fig. 2
Knowledge communication frequency is measured by the sum of the number of distilled knowledge (training labels) from A to B and that from B to A. All experiments are done on CIFAR-100 with η = 2 under 40% symmetric noise. CIFAR-100 has 50,000 training examples in total and most of the training samples are exploited in the late training processing.
Fig. 3
Fig. 3. Under different noise rates. We fix η = 2.
Fig. 4
Fig. 4. Under different η. Symmetric label noise rate r = 60%.

References

    1. Müller R, Kornblith S, Hinton GE. When does label smoothing help?; NeurIPS; 2019.
    1. Wang X, Hua Y, Kodirov E, Clifton DA, Robertson NM. ProSelfLC: Progressive self label correction for training robust deep neural networks; CVPR; 2021.
    1. Yuan L, Tay FE, Li G, Wang T, Feng J. Revisiting knowledge distillation via label smoothing regularization; CVPR; 2020.
    1. Wang Y, Ma X, Chen Z, Luo Y, Yi J, Bailey J. Symmetric cross entropy for robust learning with noisy labels; ICCV; 2019.
    1. Arazo E, Ortego D, Albert P, O’Connor N, Mcguinness K. Unsupervised label noise modeling and loss correction; ICML; 2019.

LinkOut - more resources