Finn, A. & Kushmerick, N. (2003). Active learning strategies for information extraction. Poster submission(rejected) to International Joint Conference on Artificial Intelligence (Acapulco). pdf [0]
Abstract
Information Extraction (IE) is the process of identifying a set of pre-defined relevant items in text documents. For example, an IE system might convert free text resumes into a structured form. Numerous machine learning algorithms have been developed that promise to eliminate the need for hand-crafted extraction rules. Instead, users are asked to annotate a set of training documents selected from a large collection of unlabelled documents. From these annotated documents, an IE learning algorithm generalizes a set of rules that can be used to extract items from unseen documents.
It is infeasible for users to annotate large numbers of documents. IE researchers have therefore investigated Active Learning (AL) techniques to automatically identify documents for the user to annotate [Thompson et al.1999,Scheffer and Wrobel2001,Ciravegna et al.2002].
The essence of AL is a strategy for selecting the next document to be presented to the user for annotation. The selected documents should be those that will maximise the future performance of the learned extraction rules. Document selection algorithms attempt to find regions of the instance space that have not yet been sampled in order to select the most informative example for human annotation.
Several selection strategies have been studied in the more general context of machine learning. For example, confidence-based approaches select for annotation the unlabelled instance of which the learner is least confident. While such techniques are clearly applicable to IE, we focus on novel selection algorithms that exploit the fact that the training data in question is text.