This paper describes a methodology for discovering and resolving protein names abbreviations from the full-text versions of scientific articles, implemented in the PRAISED framework with the ultimate purpose of building up a publicly available abbreviation repository. Three processing steps lie at the core of the framework: i) an abbreviation identification phase, carried out via domain-independent metrics, whose purpose is to identify all possible abbreviations within a scientific text; ii) an abbreviation resolution phase, which takes into account a number of syntactical and semantic criteria in order to match an abbreviation with its potential explanation; and iii) a dictionary-based protein name identification, which is meant to select only those abbreviations belonging to the protein science domain. A local copy of the UniProt database is used as a source repository for all the known proteins. © 2012, Jagiellonian University, Medical College, Kraków, Poland. All rights reserved.
- data mining