Subscribe

A Multi-Level Boundary Classification Approach to Information Extraction

Finn, A. (2006). A Multi-Level Boundary Classification Approach to Information Extraction. Phd thesis (University College Dublin). pdf

Abstract
Information Extraction (IE) is the process of identifying a set of pre-defined relevant items in text documents. We investigate the application of Machine Learning classification techniques to the problem of Information Extraction. In particular we use Support Vector Machines and several different feature-sets to build a set of classifiers for Information Extraction (IE). We show that this approach is competitive with current state-of-the-art Information Extraction algorithms based on specialized learning algorithms. We investigate the different components of our IE system, such as learning algorithm, feature-set and instance selection, and compare how much each component contributes to performance. We also introduce a new multi-level classification technique for improving the recall of IE systems. We show that this can give significant improvement in the performance of our IE system and gives a system with both high precision and high recall. Our system (ELIE) is an adaptive Information Extraction algorithm that uses a two-level boundary classification approach to learning. ELIE first classifies every document position as the start of a fragment to be extracted, the end of a fragment, or neither. This first level of extraction typically has high precision but mediocre recall. To increase recall, we employ a second level of classification. Positions near those positions extracted at the first level are classified by a second pair of classifiers that are biased for high recall. For example, the positions ``downstream'' from each extracted start position are classified in order to find the end of the given fragment. Our results on several benchmark corpora indicate that ELIE often outperforms state-of-the-art competitors.