Finn, A. (2002). Machine learning for genre classification. Msc thesis (University College Dublin). postscript
Abstract
Current Information Retrieval (IR) techniques succeed in identifying relevant documents. However there may be a large number of relevant documents and it is difficult to isolate those documents that most closely match a particular user's information needs. Genre analysis provides a complementary technique that can be used to improve results achieved using standard IR techniques. A genre class is a group of documents that are written in a similar style. Genre classification can be used to identify documents that are written in a style most likely to satisfy a users information need.
We consider the use of Machine Learning techniques applied to the task of automatic genre classification. We investigate two sample genre classification tasks: whether a news article is subjective or objective; and whether a review is positive or negative. We investigate the use of three different feature-sets for building genre classifiers.
We argue that traditional methods of evaluating text classifiers are insufficient for genre classifiers and emphasize domain transfer for the generated classifiers. Domain transfer indicates the ability of a genre classifier to classify documents that are about topics other than those it was trained on.
Our experiments show that genre classification using Machine Learning techniques is challenging, but feasible. For both sample genre classification tasks, we build classifiers that perform well within a single topic domain. We find that it is difficult to build genre classifiers that transfer well to other domains.