Abstract
When it comes to the quantitative evaluation of word embeddings, there are two main strategies:
extrinsic, i.e. using pre-trained embeddings as input vectors in a downstream ML task, such as language
modelling, and intrinsic, i.e. through analogy and similarity tasks that require special datasets (Bakarov,
2018).
Extrinsic evaluation
Language modelling seems to be the easiest way to evaluate historical word embeddings, since it is
language independent, scalable and does not require dataset creation. Hypothetically, using pre-trained
embeddings must lower the perplexity of a language model, even if these embeddings were trained on a
different period of the same language. However, language modelling, as well as the majority of modern
NLP tasks, is not very relevant to historical linguistics, so we might want to find a better downstream
task or turn to intrinsic evaluation.
Intrinsic evaluation
There are two major tasks used for intrinsic evaluation of word embeddings: similarity and analogy. The
similarity task consists in comparing similarity scores of two words yielded by an embedding model to
those calculated based on experts’ judgment. We did not explore this option, because it requires too much
manual work by definition. The analogy task is simply asking an embedding model “What is to a′ as b
is to b′ ?”, and expecting a as an answer. Analogy datasets can be created automatically or semi-
automatically if there exists a comprehensive historical dictionary of a language in question in machine
readable format or a WordNet.
Traditionally, analogy datasets are based on pairwise semantic proportion and therefore every question
has a single correct answer. Given the high level of variation in historical languages, such a strict
definition of a correct answer seems unjustified. Therefore, in our Early Irish analogy dataset we follow
the authors of BATS (Gladkova et al., 2016) providing several correct answers for each analogy question
and evaluating the performance with set-based metrics, such as an average of vector offset over multiple
pairs (3CosAvg).
Our dataset consists of 4 parts: morphological variation and spelling variation subsets were automatically
extracted from eDIL (eDIL, 2019), while synonym and antonym subsets are translations of
correspondent BATS parts proofread by 4 expert evaluators. However, the scores that Early Irish
embedding models achieved on the analogy dataset were low enough to be statistically insignificant. Such
a failure may be a result of the following problems:
The highest inter-annotator agreement score (Cohen’s kappa) between experts was 0.339, which reflects
the level of disagreement in the field of historical Irish linguistics. It concerns such fundamental
questions as “What is a word? Where does it begin and end? What is a normalised spelling of a word at
a particular stage of the language history?”, which was discussed in (Doyle et al., 2018) and (Doyle et al.,
2019) regarding tokenisation. It is arguable that it might be true for historical linguistics in general.
There is a lack of standardisation in different resources for the same historical language. For example,
~65% of morphological and spelling variation subsets, retrieved from eDIL, were not present in the whole
Early Irish corpus retrieved from CELT (CELT, 1997), on which the biggest model was trained. As for
synonym and antonym subsets, ~30% are missing in the corpus. Although our embedding models used
subword information and were able to handle unknown words, such a discrepancy between the corpus,
on which they were trained, and the historical dictionary, which became the source for the evaluation
dataset, seriously affected the performance. This discrepancy originates from different linguistic views
and editorial policies used by different text editors, publishers and resource developers throughout time.
References
Bakarov, A. (2018). A
| Lingua originale | Inglese |
|---|---|
| Titolo della pubblicazione ospite | Workshop: Computational models of diachronic language change |
| Pagine | 21-22 |
| Numero di pagine | 2 |
| Stato di pubblicazione | Pubblicato - 2023 |
| Evento | Workshop 3: Computational models of diachronic language change, 26th International Conference on Historical Linguistics (ICHL26) - Heidelberg (Germany) Durata: 10 lug 2023 → 11 lug 2023 |
Workshop
| Workshop | Workshop 3: Computational models of diachronic language change, 26th International Conference on Historical Linguistics (ICHL26) |
|---|---|
| Città | Heidelberg (Germany) |
| Periodo | 10/7/23 → 11/7/23 |
Keywords
- Early Irish
- historical word embeddings