Abstract
This paper describes a methodology for discovering and resolving protein names abbreviations from the full-text versions of scientific articles, implemented in the PRAISED framework with the ultimate purpose of building up a publicly available abbreviation repository. Three processing steps lie at the core of the framework: i) an abbreviation identification phase, carried out via domain-independent metrics, whose purpose is to identify all possible abbreviations within a scientific text; ii) an abbreviation resolution phase, which takes into account a number of syntactical and semantic criteria in order to match an abbreviation with its potential explanation; and iii) a dictionary-based protein name identification, which is meant to select only those abbreviations belonging to the protein science domain. A local copy of the UniProt database is used as a source repository for all the known proteins. © 2012, Jagiellonian University, Medical College, Kraków, Poland. All rights reserved.
Lingua originale | English |
---|---|
pagine (da-a) | 13-51 |
Numero di pagine | 39 |
Rivista | Bio-Algorithms and Med-Systems |
Volume | 8 |
DOI | |
Stato di pubblicazione | Pubblicato - 2012 |
Keywords
- abbreviations
- data mining
- extraction
- proteins
- resolution