Charles Babbage's brain in a jarCharles Babbage’s Brain, Photo: James Wheare

Concept

Can a machine accumulate enough information about your patterns of communication to create an effective digital doppelgänger?

Could we use the data left behind on Google’s servers and our own hard disks to effectively replace ourselves with an artificial intelligence born and bred of our online conversations and quirks? What might it be like to have a conversation with a past representation of ourselves, what would a hypothetical exchange between two digitally-reconstructed individuals look like?

Michael Edgcumbe and I approached these questions with Caprica, our rough attempt to commit to code some of the ideas of digital reincarnation put forth in the (reportedly mediocre) eponymous television series.

Both Michael and I have managed to retain a good portion of our instant messenger chat logs. My archives represent just over a half-million lines of conversation logged from about 2001 to 2004. Michael’s are a bit more recent, and weigh in at 34,000 lines. So data is in relative abundance.

The goal was to build an autonomous chat bot that would draw from the content of our logs to construct an infinite stream of back-and-forth conversation between our younger selves. Ideally, these conversations should be reasonably cogent and reflect whatever personality / themes we left behind in our logs.

Technique

Our initial approach to an algorithm was simple — the entire chat log can be considered a kind of question / answer training set. There’s a bit of latent intelligence built right into the log, since it literally documents how you responded to a wide range of queries. By finding which line in the log is the closest match to a given query, we should be able to walk forward a few lines and retrieve a reasonable response. This turns the problem into one of sentence similarity and avoids the issue of extracting and classifying meaning from the logs.

There are some peculiarities about instant messenger conversations which needed to be considered:

  • Typos are rampant
  • Netspeak doesn’t play well with NLTK dictionaries and algorithms trained on more formal corpora
  • A new line of conversation often acts as a comma; single line responses and serial responses from one person are common

With these points in mind, we tried a number of techniques for ranking similarity between a query string and lines of logged conversation. First, we wanted to increase the opportunities for a match between the query and the log, so we used lemmatization / synonym lookup to expand the query.

For example, for the query how about the weather each word is expanded into a list of synonymous terms:

    [['how'],
    ['about', 'astir', 'approximately', 'close_to', 'just_about', 'some', 'roughly', 'more_or_less', 'around', 'or_so', 'almost', 'most', 'nearly', 'near', 'nigh', 'virtually', 'well-nigh'],
    ['the'],
    ['weather', 'weather_condition', 'conditions', 'atmospheric_condition', 'endure', 'brave', 'brave_out', 'upwind']]

From there, the chat log is searched for lines containing these synonyms — each word match improves the score of a particular line, which means its more likely to wind up as the best match to the query.

Other methods attempted include turning the logs into bigrams, to give a bit more weight to pairs of words used in context — this proved too slow to run in real time, we would need to set up a cache or database of bigrams for each log to use this approach in the future. (It’s currently scrapped from the working implementation.)

We also attempted to ignore line breaks in the logs and instead treat each stream of replies from one individual as a single chunk. This left us with unnaturally long-winded responses, slower searches (since the queries were much longer) and less of a quality improvement than we expected. (Also scrapped from the working implementation.)

Finally, our algorithm handles some basic house keeping: A response gets flagged after it’s used, so that conversations won’t repeat themselves. Response scores are also normalized based on length, so that longer lines (with more potential word matches) don’t dominate the conversation. It also manages the eternal conversational bounce between each log: After a response is generated, that response becomes the query to the other log… ad infinitum until every single line is used.

The source code is available on GitHub. The caprica3-presented.py file represents the most recent working implementation.

Result

Here’s an excerpt of a hypothetical conversation between my adolescent self and Michael:

Edgwired: what are we lying about?
obrigado: the royal you
Edgwired: we had to transfer them as files rather than as music
obrigado: hah
Edgwired: heh
obrigado: wtf?
Edgwired: music is portable
obrigado: J.R. Rodale
Edgwired: plus
obrigado: additionaly
Edgwired: cool to hang out
obrigado: all this time coco
Edgwired: this is what i’m leaning towards
obrigado: i have assumed
Edgwired: LOL
obrigado: haha
Edgwired: what monitor?
obrigado: right
Edgwired: that one is pretty good
obrigado: that the version of remind me
Edgwired: fuck it
obrigado: actually it is
Edgwired: serious

The full text is also available.

Even with our crude implementation, the generated conversations are at least moderately interesting. Humans are quite good at finding patterns and extrapolating meaning where there is actually very little of either, and I think this helps mask the mediocrity of the algorithm.

Future Improvements

We have a number of ideas for improvements that didn’t make it into the first cut.

We considered Stemming the logs to increase the number of matches. However, the search code we’re using at the moment allows for partial word matches, so I’m not sure how much we would gain from this step.

Another major issue is that the log data requires a massive amount of clean-up before it’s ready for use. Ideally, we would have a program that would automatically aggregate a user’s chat (or email, or twitter, etc.) data without them needing to dig up their logs from the depths of the file system and run a bunch of finicky clean-up routines to get the data ready for use. Michael and I spent a huge amount of time dealing with character encoding issues and generally restructuring the log data so that it was consistent for both of us. Writing a reliable, hands-off parser would be a lot of work, but it would be consistent with the goals of the project: to provide access to an interactive, digital representation of oneself.

Python starts to show its slowness when you’re handling many thousands of lines of strings… for efficiency’s sake, the logs would benefit from migration to a database system.

And most importantly, the sentence similarity approach is deeply naïve. There’s a lot more to the reconstruction process than finding word matches, and to improve the results we will really need a way to extract and tag actual data from the logs. We will need some way to identify major themes and then weave them together into more convincing conversation.