Homework #1. Create a program that behaves like a UNIX text processing program (such as cat, grep, tr, etc.). Your program should take text as input (any text, or a particular text of your choosing) and output a version of the text that has been filtered and/or munged. Be creative, insightful, or intentionally banal.

Choose one text that you created with your program to read in class.

Bonus: Use the program that you created in tandem with another UNIX command line utility.

I try to avoid destroying data.

I draw upon the usual set of justifications: Storage is only getting cheaper, an empty HD occupies the same physical space as a full HD, yada yada.

Whether or not this policy is for the best, it’s left me with over a million lines of instant messenger conversation logs from my earlier years — mundane, personal conversations between myself and a handful of friends. (Running from about Jr. High to end end of High School.) If not of much general interest, the contents of these (often painfully angst-ridden) logs are a personally surreal thing to revisit.

In response to the first assignment, I wanted to draw from this well of text in some way. I’m particularly interested in the idea of accidental art — the daily, unintended collisions with things that might be formally construed as art.

I wrote a quick algorithm to put my variation of the Infinite monkey theorem to the test. Can enough angsty teens, given enough time to type, eventually produce something resembling a traditionally recognized “great work”?

I decided to pit my adolescent conversations against T.S. Eliot’s The Waste Land. I wasn’t interested in simply recreating the poem verbatim, instead I used the first and last words of each line as anchors between the poem and my logs, and anything in the middle would be filled in from my conversations based on nearby words.

So, the algorithm takes the first and last word from each line of the poem, and then looks for matches in my conversation logs. If it finds a match for both words in my logs, then it will walk forward from the first word, and backward from the last word, to create a line of text which starts and ends identically to a line in The Waste Land.

Finally, if the resulting line is too long, it will cut words out of the middle until the length of the output line matches the length of the line in The Waste Land. Currently, lines that are shorter than their equivalents in the original poem are just printed as-is. (It would be nice to find a reasonable way to beef these up to size.)

When matches aren’t found, the line is dropped. Only about 60% of the poem could be reconstructed from my million lines of conversation text. (e.g., words like “abolie” never turned up in my logs, and therefore were not available to reconstruct that line of The Waste Land.)

The code is happy to work with any plain text files. Supply it with a model text (in this case, Eliot’s poem) and a source text (in this case, my conversation logs), and it will do its best to shape the source text into the model text.

The first argument to the program is the source text, and the second is the model text. For example, from the command line, this would use aim.txt as the source and wasteland.txt as the model, and save the results to a text file named aim-wasteland.txt:

interstitial-wasteland.py aim.txt wasteland.txt >> aim-wasteland.txt

It takes a while to run, and you won’t get decent results unless the source text is huge.

Here’s the full output: interstitial-wasteland-output.txt

And the output with the original poem in parallel: interstitial-wasteland-output-parallel.txt

A small excerpt of the raw output:

I… wasn’t going us, he’s DEAD!

April one is designed for inbreeding, Memory Lane and stirring Winter Olympic Games. of recovering Earth, yet we and feeding And let me tell for like an hour…by Individuals can be dangerous though I"m confused http://winter.squaw.com/html/squawcambig.html

-What are the current problems problems?- it grows Out man,

The same excerpt in parallel with the model text:

I. THE BURIAL OF THE DEAD I… wasn’t going us, he’s DEAD!

April is the cruellest month, breeding April one is designed for inbreeding,

Memory and desire, stirring Memory Lane and stirring

Winter kept us warm, covering Winter Olympic Games. of recovering

Earth in forgetful snow, feeding Earth, yet we and feeding

And drank coffee, and talked for an hour. And let me tell for like an hour…by

In the mountains, there you feel free. Individuals can be dangerous though

I read, much of the night, and go south in the winter. I"m confused http://winter.squaw.com/html/squawcambig.html

What are the roots that clutch, what branches grow -What are the current problems problems?- it grows

Out of this stony rubbish? Son of man, Out man,

And the source code:

import sys
args = sys.argv

# I hard coded these for my local testing.
#args = ['self', 'aim.txt', 'wasteland.txt']

# Set to true if you want extra output for debugging.
# TK turn this into a command line parameters.
verbose = 0

# Set to true if you want to show the original line above the munged one.
# TK turn this into a command line parameters.
print_original = 0

if verbose: print args
if verbose: print 'Take the text from ' + args[1] + ' and model it after ' + args[2]

# Pull the filenames from stdin.
source_file_name = args[1]
model_file_name = args[2]

# Open each file. (Error handling would be good here...)
source_file = open(source_file_name, 'r')
model_file = open(model_file_name, 'r')

# Read each line of each file into a list.
source_lines = source_file.readlines()
model_lines = model_file.readlines()


# Removes usernames from the start of a line, e.g. removes "OBRIGADO:"
def anonymize(line):
if ':' in line:
colon_index = line.index(':') + 1
anonymous_line = line[colon_index:len(line)]
return anonymous_line.strip()

return line

# Clean up line breaks.
def remove_breaks(line):
line = line.replace('\n','')
line = line.replace('\r','')
return line


# Gives index of element containing word.
# Less strict than .index(string) since it finds partial matches.
def word_at(string, list):
index = 0
for item in list:
if string in item:
return index
break
index += 1

return -1


# Go through the model and look for matches to the first and last words.
index = 0
for line in model_lines:
# Make sure it's not a blank line.
line = line.strip()

# Put in line breaks if it is blank.
if len(line) == 0:
print ''

# Otherwise, start processing.
if len(line) > 1:
# Place each word in a list.
line_list = line.split(' ')
first_word = line_list[0];
last_word = line_list[-1];

if verbose: print '------------------------------------'
if verbose: print 'Line ' + str(index) + ' starts with "' + first_word + '" ends with "' + last_word + '"'

# Find first instance of first word in source file.
for first_word_line in source_lines:
if first_word in first_word_line:

# We found the starting word, now find the ending word.
for last_word_line in source_lines:
if last_word in last_word_line:

# We have both a starting and ending word match!

# Clean up, remove line breaks and attribution.
# TK problem if match was in name?
first_word_line = anonymize(remove_breaks(first_word_line))
last_word_line = anonymize(remove_breaks(last_word_line))

# For the first line, save from the word forward.
first_line_list = first_word_line.split(' ')
first_word_index = word_at(first_word, first_line_list)
first_line_list = first_line_list[first_word_index:len(first_line_list)]

# For the last line, save from the word backward.
last_line_list = last_word_line.split(' ')
last_word_index = word_at(last_word, last_line_list)
last_line_list = last_line_list[0:last_word_index + 1]

# TK remove blank stuff.
complete_line_list = first_line_list + last_line_list
if verbose: print complete_line_list

# Construct a sentence as close to the original length as possible.
model_line_length = len(line_list);

# remove words until we have the desired length.
# TK single word line problems?
while len(complete_line_list) > model_line_length:
# Pop from the middle.
complete_line_list.pop(int(len(complete_line_list) / 2))

complete_line = ' '.join(complete_line_list)

# Print the original above the munged line.
if print_original: print line

print complete_line

# Print add some line breaks for readability.
if print_original: print ''

break

break

index += 1