You are currently viewing an archive of this site. To view new content and see what I’ve been up to lately please check the main page at http://www.aidanf.net
BTE is a python module for automated extraction of body text from web pages. It can also be used to generate short teasers/summaries. download
BTE extracts the main body of text from a web page. It does this by tokenising the document and performing some shallow processing. The html document is tokenised and represented as a binary string where a 0 represents a tag token and a 1 represents a text token.
If we graph cumulative total tokens on the x axis and cumulative tag tokens on the y axis we get a graph something like that shown below.
BTE basically works by finding an i and j where we maximise the number of text tokens between i and j and maximise the number of tag tokens below i and above j.
BTE can also be used to generate summaries/teasers of news articles as the start of the body text often contains a summary of the article.
Just copy the python file into a location in your python path
html = open(sys.argv).read()
p = BodyTextExtractor.HtmlBodyTextExtractor()
x = p.body_text()
s = p.summary()
t = p.full_text()
This software is distributed under the GNU public license. Please read
the file LICENCE.