a digital scan of a 35mm film image of a processing sketch running on an LCD
Skip to Content

Building a Real-Time Web Library for Processing

In the course of working on my big screens project this semester — a real-time web clock — I’ve realized that it would be awfully nice to have a Processing library to abstract away all of the footwork involved in skimming the latest data from the usual Web 2.0 suspects like Twitter, Flickr, Facebook, Foursquare, etc.

There’s already some nice work on this front. Mark McBride’s excellent Tweet Stream library for Processing works exactly as advertised. There are also Java wrappers for popular APIs (flickrj comes to mind) that should work in Processing and are likely to make data extraction easier.

But, many of these approaches are lacking, and getting from a raw API to a real-time stream can be a lot of work. So I’ve started work on a Processing library whose sole function is to provide a real-time stream of web data from as many sources as possible, with as little delay as possible.

A few considerations and concerns for the design and implementation of the library are outlined below.

One Library vs. Many
Traditionally, APIs wrappers / libraries come à la carte. You want Flickr? Download the Flickr library. You want Twitter? Download the Twitter library. Etc.

That’s nice, but all of these libraries work in slightly different ways and putting two or more services together requires plenty of fuss and glue code. I think a lot of the interesting things you can do with real-time data involve placing it in the context of other events, which favors a one-library approach.

Also, consolidating the services into a single library means that we can use polymorphism to make dealing with generic “event” objects from different sources relatively painless. E.g., if you wanted to work with tweets and Flickr updates in similar contexts, you should be able to manage everything from a single array list since their event objects inherit from some kind of generic event class.

Maintainability
Not all services with high-rates of incoming data have clean, public-facing APIs for grabbing said data.

In the case of Foursquare, for example, there’s no way to access the most recent public check-ins through their official API. The API allows certain kinds of interactions with the service, but it doesn’t do exactly what we want.

Likewise, Facebook’s status API doesn’t let you do an empty search for all of the latest updates — instead you’re limited to a specific search term. So, in this case, there’s an API that almost does what we want, but we’ll have to get clever if we want something resembling the whole stream of public status updates.

Therefore getting real-time data from these services will involve some hairy HTTP polling and HTML scraping. These kinds of things are liable to break if / when the structure of the source page changes. There are also potential request rate limits to deal with. Keeping the library up to date and fixing breakage when it happens is going to be a challenge — but I can’t think of a way around this until more sites support officially-sanctioned access to structured real-time data through an API. (And good on Flickr and Twitter for providing real-time data in their APIs already.)

Actual Rate vs. Available Rate
Ideally, every single event from a service would be captured by the real-time web library. However, for extremely high-volume services (Twitter, for example), even the official APIs only give a fraction of the actual event streams. Again, there’s not really a way around this, but keeping tabs on what percentage of the full stream we’re currently reading might be useful — some way to compare the current message rate through the library to the actual message rate on the service’s end. (For example, it would be useful to know that the library’s current Twitter message rate is 30% of actual traffic.)

Conversely, being able to dynamically specify a rate limit for each service might be useful in certain contexts where bandwidth needs to be conserved or services need to be synchronized. (At the very least, rate limiting will be useful for my big screens project, where a full-rate sample of the services would result in graphic slowdowns.)


So how should it work from the library programmer’s perspective? Maybe something like this:

  1. WebStream webStream;
  2.  
  3. void setup() {
  4.   // instantiate the web stream object
  5.   webStream = new WebStream(this);
  6.  
  7.   // add as many (or as few) real-time services as you like
  8.   webStream.activateFlickr("api key here");
  9.   webStream.activateFacebook();
  10.   webStream.activateTwitter("api key here");
  11. }
  12.  
  13. void draw() {
  14.   // Nothing to see here, yet
  15. }
  16.  
  17. void streamEvent(Update u) {
  18.   // the web stream returns updates via a callback event
  19.   // this would print a quick summary of the most recent event
  20.   println("content: " + u.getText() +
  21.           " via: " + u.getAuthor() +
  22.           " at: " + u.getTime() +
  23.           " from: " + u.getServiceName());
  24. }

That’s the plan, at least. I’ll work on this over the next week and eventually have some code for public consumption in the next month or two. If anyone has feature requests or a different take on how something like this should work, let me know.

November 18 2010 at 3 AM

Awesome idea, Eric. And for me, great timing (:-))

For Rest of You, I'm building a time machine of sorts that captures all my web data in a SQL database and then loading it in to Processing with a PHP script. Although this library wouldn't work for my current project (as I want to maintain my own database to get records further back than the API allows), I KNOW it would be extremely popular with us ITP folks and others who want to integrate more real time web data into our Processing sketches and work.

Keep us updated!

November 23 2010 at 3 AM

I've thought about something similar last week. Not the real time stuff but the one api for all approach. I've came up with yahoos YQL. There is a bunch of apis in there (like twitter and all the yahoo services like flickr) and you can also at your own. You can query every website with simple xpath and/or javascript.

November 23 2010 at 5 PM

That's a great idea indeed and something I've been toying with myself with in PHP in the past. Though I'd highly recommend to design this from the start as a plug&play architecture, to make it more worthwhile for yourself and for others. With this I mean to have an API which allows other than your own supported web APIs to be registered/integrated with your lib, e.g. through using something like a generic WebstreamProvider interface, which all the supported service wrappers need to implement. Then you could also have a generic Rss/Atom stream provider, which would allow working with lots of non-API based data streams and don't need any extra work supporting them. Your setup code would look then like:

stream=new WebStream(this);
stream.register(new FlickrProvider("apikey");
stream.register(new FacebookProvider("apikey");
stream.register(new AtomProvider("http://....");

I'd also recommend streamEvent to take a reference to the originating provider:

void streamEvent(Update u, WebstreamProvider prov) {
...
}

Good luck and looking forward to see the 1st version!

November 24 2010 at 5 AM

robert:

I think you have to check out XMPP in connection with Openfire and ActivityStreams for Real-time processing of social media ctivities!

November 24 2010 at 6 AM

Add Your Comment