Archive for the ‘Projects’ Category

Body Text Extraction

Thursday, May 6th, 2010

A while back, when I was a Berkman intern, fellow Berktern Brian Young and I spent an afternoon modifying a Python script called BTE (Body Text Extraction). The script is an automated way to pull out the principal portion of text (the body) of an HTML document, and works by finding the portion of the document that has the highest ratio of text to tags. At the time I was interested in using BTE in a web application to do real time body extraction, which meant I needed something that was fast. BTE wasn’t quite fast enough, so Brian and I made it faster (for the nerdy among you, we improved BTE from O(n3) to O(n2)).

So why am I posting this now? Well Brian and I contacted BTE’s author, Aidan Finn, regarding the changes, and Aidan has recently incorporated our changes into the official code.

Big thanks go to Brian. I couldn’t have done the coding myself (I didn’t and still don’t know Python,) and while at the end we weren’t sure who did what, I’m sure Brian’s 1337 computer hacking skills were far greater than mine.

More information on BTE can be found here, and the code can be found at GitHub.

The First Month

Monday, June 29th, 2009

It’s the end of June, which means it is the end of my first month of working on TermsWatch. So what has happened in this first month?

Enter the EFF

The first few days of the month were spent settling in at Berkman and getting my new(ish) laptop ready for work. By Thursday I was ready to get down to business. Well, it just so happened that Thursday was launch day for the EFF‘s latest project, TOSBack. I spent Thursday afternoon playing around with TOSBack and finding out as much as I could about it. I then spent the next few days running around like a chicken with its head cut off; TOSBack does half of what I was going to do. I thought to myself, “well what do I do now?”

Fortunately there are cooler heads than I at Berkman, and they decided it was best to give the EFF a call. We had a few phone calls with Tim Jones, Activism & Technology Manager at the EFF and the man behind TOSBack. Turns out Berkman and the EFF have a lot of similar hopes and dreams regarding a service such as TOSBack. It also turns out that Tim was about to go on vacation for the summer, so nobody was going to be working on the project for a couple of months. All in all, this turned out to be a great opportunity for everybody involved and it was agreed that I would spend the summer working on the TOSBack code.

Symfony & Text Extraction

With the TOSBack code in hand I went to work. The first order of business was to port TOSBack over to Symfony, a web application framework. A framework such as Symfony has several advantages, including taking care of some tedious aspects of creating an application, such as checking input for security issues and generating administration pages. All in all this was a fairly painless process.

The latest, and current, problem that I have been tackling is how to extract the important information from a web page. Fortunately, as is becoming a common occurrence this summer, it turns out there are quite a few bright people in our small area who have done work like this, and they are all easy to talk to. I’ve spent nearly a week talking to these bright people, gaining insight into various approaches and understanding exactly what it is I need to do. I have a pretty good idea what it is I am going to do now (for those interested in the technical stuff, check out this summary of the extractor,) and with a new week on the horizon, I hope to get this thing working quickly.

What Is Celebrity Bar Fights?

Monday, May 25th, 2009

Celebrity Bar Fight is a website (located at www.celebritybarfight.com) that lets you decide who would win in a drunken bar fight between two celebrities. The site is something that my friend David and I work on in our spare time; we use it to learn, to amuse, and to have fun.

The site was inspired by a game David and I would play at the bar. We would name two celebrities, then argue about who would win in a no-rules bar fight (assuming both celebrities were in their peak condition.) This would usually entertain us for awhile, and we often ended up getting others at the bar to play along. And for those who are wondering, yes, this game was itself inspired by the conversation in Fight Club in which Edward Norton and Brad Pitt talk about which celebrity they most wanted to fight.

Interning at Harvard

Tuesday, May 19th, 2009

Yup, via a chain of events I have landed an internship at The Berkman Center, a center within the Harvard Law School. Starting June 1st I will be developing TermsWatch, a web service that will provide notification of updates to, and plain English explanations of, those Terms of Use and Terms of Service agreements (Terms) that every website and piece of software makes you consent to.

The whole thing started back in February when Facebook updated its Terms of Use. The update occurred on February 4th, but nobody noticed the changes until the 15th (keep in mind that Facebook has around 175 million active users.)

Furthermore, Facebook’s Terms of Use included an implied consent clause regarding changes. As many as 175 million users consented to the February 4th changes completely unaware that they were consenting to anything or that any change had occurred. This lack of notice presented an obvious problem, so I began to think about a program that would monitor Facebook’s Terms of Use and alert individuals when a change was detected.

Before I could begin working on the program Facebook implemented its new, democratic process for updating its Terms (now called the Statement of Rights and Responsibilities.) Satisfied that notice was now being given, I joined many other Facebook users in commenting on the proposed Rights and Responsibilities. While it was great that users were given a voice in the process, it also became clear that most of us (myself included) have no idea what a lot of the language means in the Statement of Rights and Responsibilities, or why it needs to be included. One particular clause angered and disturbed a lot of users by allowing Facebook the right to transfer and sublicense its ability to reproduce and modify users’ content. Fortunately, a number of individuals were able to explain why such a clause is required (so Facebook can allow third party applications to access and use its users’ data.) Still, it became obvious that the dense legalese of the Statement of Rights and Responsibilities is too difficult for most (aside from experts and professionals in law) to read and understand.

It was about this time that Google started the application process for its Summer of Code. At first I scanned through the list of project ideas related to technology and society (my emerging area of interest,) but after a while I thought it would be neat to work on a generalized version of the program I had thought of in February that would monitor a site’s (or software provider’s) Terms for changes. I also thought about the difficulty of reading Terms and decided that the program would be much more useful if it included a way for legal experts to attach plain English explanations to the Terms. With all this in mind I wrote an application for the Summer of Code.

All Summer of Code programs need to be written for one of the available mentoring organizations. Since this program appeared to be a perfect fit for it, I applied with the Berkman Center as my mentor. Since I was part of last year’s Summer of Code, I figured I would easily be part of this year’s, so I sat back and waited for the good word. Less than four days after the submission deadline I got the not so good word that my application was considered ineligible. But, no sooner had I found out about the ineligibility than I received an email from the Berkman Center; they loved the idea and they asked me to apply to their internship program and spend the summer in Cambridge. Well, I couldn’t say no to that, so I applied. The process went smoothly, I was accepted, and now I am starting to pack, because I have to move in less than two weeks.

Description of TermsWatch

Tuesday, May 19th, 2009

TermsWatch is an attempt to combine the power of knowledge and software to provide a tool that end users can use while reading and executing the Terms of Use or Terms of Service (Terms) that are typically encountered while using software or websites. Specifically, TermsWatch will aim to tackle two problems:

  • Service Providers are Allowed to Change the Terms Without Notice: The majority of Terms include a clause that allows the service provider to make changes to the Terms with out notifying its users. Further, this clause usually includes wording stating that a user’s continued use of the provided service will indicate an acceptance of the new or altered Terms.
  • Most Users Can Not Understand the Terms: By their nature, Terms contain lengthy, dense language that is often very difficult for common users to understand. Consequently, few users will read the Terms that they are asked to consent to. Fewer still will read the Terms and understand what they are consenting to, what rights they are giving to the service provider, and what the nature of their relationship with the service provider is.

In order to combat these issues, TermsWatch is designed to do the following:

  • Monitor Terms of Use: The service will maintain a list of URLs, each referring to the web page that displays the Terms of a service provider. Periodically, TermsWatch will download these web pages and use XML parsing to extract the text of the Terms. This extracted text will then be compared to the most recent version of the Terms that is available in TermsWatch’s database. If the two versions differ, the new Terms will be saved to the database and an RSS feed, specific to the service provider, will be updated in order to alert users to any changes in the Terms.
  • Store Annotations: In addition to storing the Terms of each service provider being monitored, TermsWatch will also store annotations made to the stored Terms. This would include storing explanations and comments made about a specific version of a provider’s Terms, as well as storing explanations/comments about changes that have occurred between versions.
  • Expose Data Via Public API: To truly be useful, the data collected from monitoring Terms and storing annotations must be made accessible by other programs and services. To make this data accessible, TermsWatch will provide a public API that will allow for retrieval of data about a specific service provider’s Terms, such as the text of the Terms or annotations made to the Terms. The API will also allow for the creation and editing of annotations.