Towards a Historical Text Re-use Detection

Marco Büchler, Philip R. Burns, Martin Müller, Emily Franzini, Greta Franzini

Risultato della ricerca: Contributo in libroChapter

Abstract

Text re-use describes the spoken and written repetition of information. Historical text re-use, with its longer time span, embraces a larger set of morphological, linguistic, syntactic, semantic and copying variations, thus adding a complication to text-reuse detection. Furthermore, it increases the chances of redundancy in a Digital Library. In Natural Language Processing it is crucial to remove these redundancies before applying any kind of machine learning techniques to the text. In Humanities, these redundancies foreground textual criticism and allow scholars to identify lines of transmission. This chapter investigates two aspects of the historical text re-use detection process, based on seven English editions of the Holy Bible. First, we measure the performance of several techniques. For this purpose, when considering a verse—such as book Genesis, Chapter 1, Verse 1—that is present in two editions, one verse is always understood as a paraphrase of the other. It is worth noting that paraphrasing is considered a hyponym of text re-use. Depending on the intention with which the new version was created, verses tend to differ significantly in the wording, but not in the meaning. Secondly, this chapter explains and evaluates a way of extracting paradigmatic relations. However, as regards historical languages, there is a lack of language resources (for example, WordNet) that makes non-literal text re-use and paraphrases much more difficult to identify. These differences are present in the form of replacements, corrections, varying writing styles, etc. For this reason, we introduce both the aforementioned and other correlated steps as a method to identify text re-use, including language acquisition to detect changes that we call paradigmatic relations. The chapter concludes with the recommendation to move from a ”single run” detection to an iterative process by using the acquired relations to run a new task.
Lingua originaleEnglish
Titolo della pubblicazione ospiteText Mining, Theory and Applications of Natural Language Processing
EditorAlexander Mehler Chris Biemann
Pagine221-238
Numero di pagine18
DOI
Stato di pubblicazionePubblicato - 2014

Keywords

  • Brute Force Method
  • Digital Library
  • Locality Sensitive Hashing
  • Paradigmatic Relation
  • String Similarity

Fingerprint

Entra nei temi di ricerca di 'Towards a Historical Text Re-use Detection'. Insieme formano una fingerprint unica.

Cita questo