Building a Real-Time Web Library for Processing

In the course of working on my big screens project this semester — a real-time web clock — I’ve realized that it would be awfully nice to have a Processing library to abstract away all of the footwork involved in skimming the latest data from the usual Web 2.0 suspects like Twitter, Flickr, Facebook, Foursquare, etc.

There’s already some nice work on this front.

Mark McBride’s excellent Tweet Stream library for Processing works exactly as advertised. There are also Java wrappers for popular APIs (flickrj comes to mind) that should work in Processing and are likely to make data extraction easier.

But, many of these approaches are lacking, and getting from a raw API to a real-time stream can be a lot of work. So I’ve started work on a Processing library whose sole function is to provide a real-time stream of web data from as many sources as possible, with as little delay as possible.

A few considerations and concerns for the design and implementation of the library are outlined below.

One Library vs. Many

Traditionally, APIs wrappers / libraries come à la carte. You want Flickr? Download the Flickr library. You want Twitter? Download the Twitter library. Etc.

That’s nice, but all of these libraries work in slightly different ways and putting two or more services together requires plenty of fuss and glue code. I think a lot of the interesting things you can do with real-time data involve placing it in the context of other events, which favors a one-library approach.

Also, consolidating the services into a single library means that we can use polymorphism to make dealing with generic “event” objects from different sources relatively painless. E.g., if you wanted to work with tweets and Flickr updates in similar contexts, you should be able to manage everything from a single array list since their event objects inherit from some kind of generic event class.

Maintainability

Not all services with high-rates of incoming data have clean, public-facing APIs for grabbing said data.

In the case of Foursquare, for example, there’s no way to access the most recent public check-ins through their official API. The API allows certain kinds of interactions with the service, but it doesn’t do exactly what we want.

Likewise, Facebook’s status API doesn’t let you do an empty search for all of the latest updates — instead you’re limited to a specific search term. So, in this case, there’s an API that almost does what we want, but we’ll have to get clever if we want something resembling the whole stream of public status updates.

Therefore getting real-time data from these services will involve some hairy HTTP polling and HTML scraping. These kinds of things are liable to break if / when the structure of the source page changes. There are also potential request rate limits to deal with. Keeping the library up to date and fixing breakage when it happens is going to be a challenge — but I can’t think of a way around this until more sites support officially-sanctioned access to structured real-time data through an API. (And good on Flickr and Twitter for providing real-time data in their APIs already.)

Actual Rate vs. Available Rate

Ideally, every single event from a service would be captured by the real-time web library. However, for extremely high-volume services (Twitter, for example), even the official APIs only give a fraction of the actual event streams. Again, there’s not really a way around this, but keeping tabs on what percentage of the full stream we’re currently reading might be useful — some way to compare the current message rate through the library to the actual message rate on the service’s end. (For example, it would be useful to know that the library’s current Twitter message rate is 30% of actual traffic.)

Conversely, being able to dynamically specify a rate limit for each service might be useful in certain contexts where bandwidth needs to be conserved or services need to be synchronized. (At the very least, rate limiting will be useful for my big screens project, where a full-rate sample of the services would result in graphic slowdowns.)

So how should it work from the library programmer’s perspective? Maybe something like this:

WebStream webStream;

void setup() {
  // instantiate the web stream object
  webStream = new WebStream(this);

  // add as many (or as few) real-time services as you like
  webStream.activateFlickr("api key here");
  webStream.activateFacebook();
  webStream.activateTwitter("api key here");
}

void draw() {
  // Nothing to see here, yet
}

void streamEvent(Update u) {
  // the web stream returns updates via a callback event
  // this would print a quick summary of the most recent event
  println("content: " + u.getText() +
  " via: " + u.getAuthor() +
  " at: " + u.getTime() +
  " from: " + u.getServiceName());
}

That’s the plan, at least. I’ll work on this over the next week and eventually have some code for public consumption in the next month or two. If anyone has feature requests or a different take on how something like this should work, let me know.