Body Text Extraction

by Bill on May 6, 2010 under projects

1 minute read

A while back, when I was a Berkman intern, fellow Berktern Brian Young and I spent an afternoon modifying a Python script called BTE (Body Text Extraction). The script is an automated way to pull out the principal portion of text (the body) of an HTML document, and works by finding the portion of the document that has the highest ratio of text to tags. At the time I was interested in using BTE in a web application to do real time body extraction, which meant I needed something that was fast. BTE wasn't quite fast enough, so Brian and I made it faster (for the nerdy among you, we improved BTE from O(n³) to O(n²)).

So why am I posting this now? Well Brian and I contacted BTE's author, Aidan Finn, regarding the changes, and Aidan has recently incorporated our changes into the official code.

Big thanks go to Brian. I couldn't have done the coding myself (I didn't and still don't know Python,) and while at the end we weren't sure who did what, I'm sure Brian's 1337 computer hacking skills were far greater than mine.

More information on BTE can be found here, and the code can be found at GitHub.

Berkman Center, BTE, Python

Let me know what you think of this article on twitter @wbushey or leave a comment below!