TY - JOUR
T1 - Mitigating exposure bias in large language model distillation: an imitation learning approach
AU - Pozzi, Andrea
AU - Incremona, Alessandro
AU - Tessera, D.
AU - Toti, Daniele
PY - 2025
Y1 - 2025
N2 - Knowledge distillation is recognized as a valuable model compression strategy that alleviates the computational\r\nburden of large language models while preserving performance. This strategy involves training a smaller model\r\nutilizing both real data and predictions from a more cumbersome model. Traditional distillation methods,\r\nhowever, are often compromised by exposure bias, which results from reliance on next-step prediction training\r\nloss. This bias emerges when models are tested in free-running mode, differing from their training regime and\r\nleading to a progressive drift in input distributions between testing and training phases. An analogous issue,\r\nknown as ‘distributional shift’, has been effectively addressed in imitation learning through various methodologies.\r\nTherefore, this paper specifically tailors an imitation learning-based solution to a traditional knowledge\r\ndistillation framework which inherently considers both real data and the teacher’s predictions as dual sources of\r\nexpert demonstrations. The effectiveness of this approach is demonstrated over five different test datasets, where\r\nit outperforms traditional benchmarks across all evaluation metrics. Specifically, it achieves superior results in\r\nperplexity, multi-token generation, and G-Eval score, indicating improvements in both predictive accuracy and\r\nalignment with human judgment in text quality. These results underscore the potential of this approach to\r\neffectively address exposure bias in large language model distillation.
AB - Knowledge distillation is recognized as a valuable model compression strategy that alleviates the computational\r\nburden of large language models while preserving performance. This strategy involves training a smaller model\r\nutilizing both real data and predictions from a more cumbersome model. Traditional distillation methods,\r\nhowever, are often compromised by exposure bias, which results from reliance on next-step prediction training\r\nloss. This bias emerges when models are tested in free-running mode, differing from their training regime and\r\nleading to a progressive drift in input distributions between testing and training phases. An analogous issue,\r\nknown as ‘distributional shift’, has been effectively addressed in imitation learning through various methodologies.\r\nTherefore, this paper specifically tailors an imitation learning-based solution to a traditional knowledge\r\ndistillation framework which inherently considers both real data and the teacher’s predictions as dual sources of\r\nexpert demonstrations. The effectiveness of this approach is demonstrated over five different test datasets, where\r\nit outperforms traditional benchmarks across all evaluation metrics. Specifically, it achieves superior results in\r\nperplexity, multi-token generation, and G-Eval score, indicating improvements in both predictive accuracy and\r\nalignment with human judgment in text quality. These results underscore the potential of this approach to\r\neffectively address exposure bias in large language model distillation.
KW - Exposure bias
KW - Imitation learning
KW - Knowledge distillation
KW - Multi-token generation
KW - Exposure bias
KW - Imitation learning
KW - Knowledge distillation
KW - Multi-token generation
UR - https://publicatt.unicatt.it/handle/10807/312759
UR - https://www.scopus.com/inward/citedby.uri?partnerID=HzOxMe3b&scp=105000708744&origin=inward
UR - https://www.scopus.com/inward/record.uri?partnerID=HzOxMe3b&scp=105000708744&origin=inward
U2 - 10.1007/s00521-025-11162-0
DO - 10.1007/s00521-025-11162-0
M3 - Article
SN - 0941-0643
SP - N/A-N/A
JO - Neural Computing and Applications
JF - Neural Computing and Applications
IS - N/A
ER -