We present a proven proposition to precisely answer
"What makes a good data augmentation (DA) in knowledge distillation (KD)?": A good DA should reduce the covariance of the teacher-student cross-entropy.
We present a practical metric that only needs the teacher to measure the goodness of a DA in KD: the
stddev of teacher’s mean probability (shorted as
T. stddev).
Interestingly, T. stddev poses a strong positive correlation (note the p-values are far below 5%) with student’s test loss (S. test loss) on CIFAR100 and Tiny ImageNet (see the right figure above), despite knowing nothing about the student, implying the goodness of a DA in KD is probably student-invariant.
Based on the theory, we further propose an entropy-based data picking algorithm that can further boost prior SOTA DA scheme (CutMix) in KD, resulting in a new strong DA method,
CutMixPick.
Finally, we show how the theory can be utilized in practice to harvest considerable performance gains simply by using a stronger DA with prolonged training epochs.