Memory-Replay Knowledge Distillation
- PMID: 33921068
- PMCID: PMC8071405
- DOI: 10.3390/s21082792
Memory-Replay Knowledge Distillation
Abstract
Knowledge Distillation (KD), which transfers the knowledge from a teacher to a student network by penalizing their Kullback-Leibler (KL) divergence, is a widely used tool for Deep Neural Network (DNN) compression in intelligent sensor systems. Traditional KD uses pre-trained teacher, while self-KD distills its own knowledge to achieve better performance. The role of the teacher in self-KD is usually played by multi-branch peers or the identical sample with different augmentation. However, the mentioned self-KD methods above have their limitation for widespread use. The former needs to redesign the DNN for different tasks, and the latter relies on the effectiveness of the augmentation method. To avoid the limitation above, we propose a new self-KD method, Memory-replay Knowledge Distillation (MrKD), that uses the historical models as teachers. Firstly, we propose a novel self-KD training method that penalizes the KD loss between the current model's output distributions and its backup outputs on the training trajectory. This strategy can regularize the model with its historical output distribution space to stabilize the learning. Secondly, a simple Fully Connected Network (FCN) is applied to ensemble the historical teacher's output for a better guidance. Finally, to ensure the teacher outputs offer the right class as ground truth, we correct the teacher logit output by the Knowledge Adjustment (KA) method. Experiments on the image (dataset CIFAR-100, CIFAR-10, and CINIC-10) and audio (dataset DCASE) classification tasks show that MrKD improves single model training and working efficiently across different datasets. In contrast to the existing fancy self-KD methods with various external knowledge, the effectiveness of MrKD sheds light on the usually abandoned historical models during the training trajectory.
Keywords: Deep Neural Network; Fully Connected Network; Knowledge Adjustment; audio classification; image classification; self-knowledge distillation; training trajectory.
Conflict of interest statement
The authors declare no conflict of interest.
Figures
References
-
- He K., Zhang X., Ren S., Sun J. Deep residual learning for image recognition; Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; Las Vegas, NV, USA. 27–30 June 2016; pp. 770–778.
-
- Chen Y., Li J., Xiao H., Jin X., Yan S., Feng J. Dual path networks. Adv. Neural Inf. Process. Syst. 2017:4467–4475.
-
- Howard A.G., Zhu M., Chen B., Kalenichenko D., Wang W., Weyand T., Andreetto M., Adam H. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv. 20171704.04861
-
- Sandler M., Howard A., Zhu M., Zhmoginov A., Chen L.C. Mobilenetv2: Inverted residuals and linear bottlenecks; Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; Salt Lake City, UT, USA. 18–23 June 2018; pp. 4510–4520.
Grants and funding
LinkOut - more resources
Full Text Sources
Other Literature Sources
