Mitigating exposure bias in large language model distillation: an imitation learning approach

Andrea Pozzi*, Alessandro Incremona, D. Tessera, Daniele Toti

*Autore corrispondente per questo lavoro

Risultato della ricerca: Contributo in rivistaArticolo

Abstract

 Knowledge distillation is recognized as a valuable model compression strategy that alleviates the computational\r\nburden of large language models while preserving performance. This strategy involves training a smaller model\r\nutilizing both real data and predictions from a more cumbersome model. Traditional distillation methods,\r\nhowever, are often compromised by exposure bias, which results from reliance on next-step prediction training\r\nloss. This bias emerges when models are tested in free-running mode, differing from their training regime and\r\nleading to a progressive drift in input distributions between testing and training phases. An analogous issue,\r\nknown as ‘distributional shift’, has been effectively addressed in imitation learning through various methodologies.\r\nTherefore, this paper specifically tailors an imitation learning-based solution to a traditional knowledge\r\ndistillation framework which inherently considers both real data and the teacher’s predictions as dual sources of\r\nexpert demonstrations. The effectiveness of this approach is demonstrated over five different test datasets, where\r\nit outperforms traditional benchmarks across all evaluation metrics. Specifically, it achieves superior results in\r\nperplexity, multi-token generation, and G-Eval score, indicating improvements in both predictive accuracy and\r\nalignment with human judgment in text quality. These results underscore the potential of this approach to\r\neffectively address exposure bias in large language model distillation.
Lingua originaleInglese
pagine (da-a)N/A-N/A
RivistaNeural Computing and Applications
Numero di pubblicazioneN/A
DOI
Stato di pubblicazionePubblicato - 2025

All Science Journal Classification (ASJC) codes

  • Software
  • Intelligenza Artificiale

Keywords

  • Exposure bias
  • Imitation learning
  • Knowledge distillation
  • Multi-token generation

Fingerprint

Entra nei temi di ricerca di 'Mitigating exposure bias in large language model distillation: an imitation learning approach'. Insieme formano una fingerprint unica.

Cita questo