We present a proven proposition to precisely answer "What makes a good data augmentation (DA) in knowledge distillation (KD)?"
: A good DA should reduce the covariance of the teacher-student cross-entropy.
We present a practical metric that only needs the teacher to measure the goodness of a DA in KD: the stddev of teacher’s mean probability
(shorted as T. stddev
Interestingly, T. stddev poses a strong positive correlation (note the p-values are far below 5%) with student’s test loss (S. test loss) on CIFAR100 and Tiny ImageNet (see the right figure above), despite knowing nothing about the student, implying the goodness of a DA in KD is probably student-invariant.
Based on the theory, we further propose an entropy-based data picking algorithm that can further boost prior SOTA DA scheme (CutMix) in KD, resulting in a new strong DA method, CutMixPick
Finally, we show how the theory can be utilized in practice to harvest considerable performance gains simply by using a stronger DA with prolonged training epochs.