Evaluating historical word embeddings: strategies, challenges and pitfalls

Oksana Dereza, Theodorus Fransen, John P. Mccrae

Risultato della ricerca: Contributo in libroContributo a convegno


When it comes to the quantitative evaluation of word embeddings, there are two main strategies: extrinsic, i.e. using pre-trained embeddings as input vectors in a downstream ML task, such as language modelling, and intrinsic, i.e. through analogy and similarity tasks that require special datasets (Bakarov, 2018). Extrinsic evaluation Language modelling seems to be the easiest way to evaluate historical word embeddings, since it is language independent, scalable and does not require dataset creation. Hypothetically, using pre-trained embeddings must lower the perplexity of a language model, even if these embeddings were trained on a different period of the same language. However, language modelling, as well as the majority of modern NLP tasks, is not very relevant to historical linguistics, so we might want to find a better downstream task or turn to intrinsic evaluation. Intrinsic evaluation There are two major tasks used for intrinsic evaluation of word embeddings: similarity and analogy. The similarity task consists in comparing similarity scores of two words yielded by an embedding model to those calculated based on experts’ judgment. We did not explore this option, because it requires too much manual work by definition. The analogy task is simply asking an embedding model “What is to a′ as b is to b′ ?”, and expecting a as an answer. Analogy datasets can be created automatically or semi- automatically if there exists a comprehensive historical dictionary of a language in question in machine readable format or a WordNet. Traditionally, analogy datasets are based on pairwise semantic proportion and therefore every question has a single correct answer. Given the high level of variation in historical languages, such a strict definition of a correct answer seems unjustified. Therefore, in our Early Irish analogy dataset we follow the authors of BATS (Gladkova et al., 2016) providing several correct answers for each analogy question and evaluating the performance with set-based metrics, such as an average of vector offset over multiple pairs (3CosAvg). Our dataset consists of 4 parts: morphological variation and spelling variation subsets were automatically extracted from eDIL (eDIL, 2019), while synonym and antonym subsets are translations of correspondent BATS parts proofread by 4 expert evaluators. However, the scores that Early Irish embedding models achieved on the analogy dataset were low enough to be statistically insignificant. Such a failure may be a result of the following problems: The highest inter-annotator agreement score (Cohen’s kappa) between experts was 0.339, which reflects the level of disagreement in the field of historical Irish linguistics. It concerns such fundamental questions as “What is a word? Where does it begin and end? What is a normalised spelling of a word at a particular stage of the language history?”, which was discussed in (Doyle et al., 2018) and (Doyle et al., 2019) regarding tokenisation. It is arguable that it might be true for historical linguistics in general. There is a lack of standardisation in different resources for the same historical language. For example, ~65% of morphological and spelling variation subsets, retrieved from eDIL, were not present in the whole Early Irish corpus retrieved from CELT (CELT, 1997), on which the biggest model was trained. As for synonym and antonym subsets, ~30% are missing in the corpus. Although our embedding models used subword information and were able to handle unknown words, such a discrepancy between the corpus, on which they were trained, and the historical dictionary, which became the source for the evaluation dataset, seriously affected the performance. This discrepancy originates from different linguistic views and editorial policies used by different text editors, publishers and resource developers throughout time. References Bakarov, A. (2018). A
Lingua originaleEnglish
Titolo della pubblicazione ospiteWorkshop: Computational models of diachronic language change
Numero di pagine2
Stato di pubblicazionePubblicato - 2023
EventoWorkshop 3: Computational models of diachronic language change, 26th International Conference on Historical Linguistics (ICHL26) - Heidelberg (Germany)
Durata: 10 lug 202311 lug 2023


WorkshopWorkshop 3: Computational models of diachronic language change, 26th International Conference on Historical Linguistics (ICHL26)
CittàHeidelberg (Germany)


  • Early Irish
  • historical word embeddings


Entra nei temi di ricerca di 'Evaluating historical word embeddings: strategies, challenges and pitfalls'. Insieme formano una fingerprint unica.

Cita questo