Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Apr 15;21(8):2792.
doi: 10.3390/s21082792.

Memory-Replay Knowledge Distillation

Affiliations

Memory-Replay Knowledge Distillation

Jiyue Wang et al. Sensors (Basel). .

Abstract

Knowledge Distillation (KD), which transfers the knowledge from a teacher to a student network by penalizing their Kullback-Leibler (KL) divergence, is a widely used tool for Deep Neural Network (DNN) compression in intelligent sensor systems. Traditional KD uses pre-trained teacher, while self-KD distills its own knowledge to achieve better performance. The role of the teacher in self-KD is usually played by multi-branch peers or the identical sample with different augmentation. However, the mentioned self-KD methods above have their limitation for widespread use. The former needs to redesign the DNN for different tasks, and the latter relies on the effectiveness of the augmentation method. To avoid the limitation above, we propose a new self-KD method, Memory-replay Knowledge Distillation (MrKD), that uses the historical models as teachers. Firstly, we propose a novel self-KD training method that penalizes the KD loss between the current model's output distributions and its backup outputs on the training trajectory. This strategy can regularize the model with its historical output distribution space to stabilize the learning. Secondly, a simple Fully Connected Network (FCN) is applied to ensemble the historical teacher's output for a better guidance. Finally, to ensure the teacher outputs offer the right class as ground truth, we correct the teacher logit output by the Knowledge Adjustment (KA) method. Experiments on the image (dataset CIFAR-100, CIFAR-10, and CINIC-10) and audio (dataset DCASE) classification tasks show that MrKD improves single model training and working efficiently across different datasets. In contrast to the existing fancy self-KD methods with various external knowledge, the effectiveness of MrKD sheds light on the usually abandoned historical models during the training trajectory.

Keywords: Deep Neural Network; Fully Connected Network; Knowledge Adjustment; audio classification; image classification; self-knowledge distillation; training trajectory.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

Figure 1
Figure 1
The framework of our proposed memory replay Knowledge Method with Fully Connected Network and Knowledge Adjustment.
Figure 2
Figure 2
Simplified graphical illustration for different self-knowledge distillation methods. (a) Class-wise self-Knowledge Distillation [19]; (b) self-Knowledge Distillation [21]; (c) memory replay Knowledge Distillation.
Figure 3
Figure 3
The framework of MrKD-plain without Fully Connected Network (FCN) ensemble and Knowledge Adjustment (KA).
Figure 4
Figure 4
The framework of the Fully Connected Network.
Figure 5
Figure 5
Knowledge Adjustment of a wrong probability offered by an imperfect teacher. The distribution is from a sample of the CIFAR-10 training dataset, whose ground truth label is ‘ship’, but the teacher’s prediction is ‘car’. Their values are exchanged.
Figure 6
Figure 6
Samples from the CIRFAR-100 dataset. Upper: original image. Lower: augmented image. The labels from left to right: leopard, train, fox, truck, snail, wolf, castle, and cockroach.
Figure 7
Figure 7
Performance on CIFAR-100 with different update frequency κ.
Figure 8
Figure 8
Performance on CIFAR-100 with different copy amount n.
Figure 9
Figure 9
Performance of MrKD on CIFAR-100 with different depth FCN.
Figure 10
Figure 10
Samples from the CINIC-10 dataset. Upper: inherit from the CIFAR-10 dataset. Lower: extended from ImageNet dataset. The labels from left to right: plane, ship, bird, dog, car, cat, horse, and deer.
Figure 11
Figure 11
The 10-s audio clips from DCASE’18 ASC [30]. Left: the raw waveform data. Right: the corresponding log Mel spectrogram. The acoustic scenes from top to down: airport, park, shopping mall, bus.
Figure 12
Figure 12
The 10-s audio clips from DCASE’20 Low Complexity ASC [31]. Left: the raw waveform data. Right: the corresponding log Mel spectrogram. The acoustic scenes from top to down: transportation (traveling by tram), indoor (metro station), outdoor (park).

References

    1. He K., Zhang X., Ren S., Sun J. Deep residual learning for image recognition; Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; Las Vegas, NV, USA. 27–30 June 2016; pp. 770–778.
    1. Huang G., Liu Z., Pleiss G., Van Der Maaten L., Weinberger K. Convolutional Networks with Dense Connectivity. IEEE Trans. Pattern Anal. Mach. Intell. 2019 doi: 10.1109/TPAMI.2019.2918284. - DOI - PubMed
    1. Chen Y., Li J., Xiao H., Jin X., Yan S., Feng J. Dual path networks. Adv. Neural Inf. Process. Syst. 2017:4467–4475.
    1. Howard A.G., Zhu M., Chen B., Kalenichenko D., Wang W., Weyand T., Andreetto M., Adam H. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv. 20171704.04861
    1. Sandler M., Howard A., Zhu M., Zhmoginov A., Chen L.C. Mobilenetv2: Inverted residuals and linear bottlenecks; Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; Salt Lake City, UT, USA. 18–23 June 2018; pp. 4510–4520.