POS Shuffler | Frontier Nerds

Homework #2: The digital cut-up. Write a program that reads in and creatively re-arranges the content of one or more source texts. What is the unit of your cut-up technique? (the word, the line, the character? something else?) How does your procedure relate (if at all) to your choice of source text? Feel free to build on your assignment from last week.

I wanted to build a cut-up machine that was as grammatically and syntactically non-invasive as possible, while still significantly munging the source text.

So I decided to shuffle the text in a way that treated parts of speech as sacred — the words could fall where they may, but if the first word was a noun in the source text, it had better be replaced with a noun in the output. If the second word was a verb, nothing but a verb from elsewhere in the text should take its place, and so on.

Programmatically deriving a word’s part of speech turns out to be a major issue, so I leaned on NLTK to take care of this. It actually does a pretty decent job. From there it’s just a matter of storing lists of the words in a dictionary keyed to each part of speech, shuffling them, and then reconstituting the text.

I ran Obama’s most recent state of the union address through my algorithm. It’s formal enough and carefully constructed enough so as not to pose a significant challenged to NLTK’s Brown Corpus trained part of speech tagger. Also, I was struck by how much the output resembles the famous Bushisms, albeit with a lingering Obama-esque tinge.

Here’s some output:

Bloody Biden, succeed. Run Constitution, convictions from America, were moments, But great Allies: tested. union. declares during on time to look, the prosperity must time to union. history of the struggle at our For Sunday one leaders, our people have fulfilled all Bull Union chose done so of hesitations despite progress And America and we are turned again on the war that call and midst; in marchers as inevitable anything and fellow state. they’s tempting to move very of the guests and president at our victory distinguished civil, that President were back done to Tuesday and when the Beach was tested back about Our Omaha but these Americans great beaten that duty. Congress, nation crashed forward so of These Madame the future was of certain. tranquility. And first years periods was landed because doubt. Again, the depression was assume and Congress Speaker was rights on destined the market of our moments and the courage in our Black and at this our fears and divisions, our disagreements and our members, Vice prevailed that we have to answer always of one strife And 220 times.

They, It have When and much, we shall give information’s strength.

Sample output from the full text of Obama’s 2010 state of the union is also available here. The original text is also available for comparison.

The source code follows. Pretty simple, NLTK does most of the heavy lifting.

import sys
import nltk
import random
import re

# Grab a file from standard input, dump it in a string.
source_text = sys.stdin.read()

# Use NLTK to make some guesses about each word's part of speech.
token_text = nltk.word_tokenize(source_text)
pos_text = nltk.pos_tag(token_text)

# Set up a dictionary where each key is a POS holding a list
# of each word of that type from the text.
pos_table = dict()

for tagged_word in pos_text:
  # Create the list, if it doesn't exist already.
  if tagged_word[1] not in pos_table:
    pos_table[tagged_word[1]] = list()

  pos_table[tagged_word[1]].append(tagged_word[0])

# Scramble the word lists.
for pos_key in pos_table:
  random.shuffle(pos_table[pos_key])

# Rebuild the text.
output = str()

for tagged_word in pos_text:
  # Take the last word from the scrambled list.
  word = pos_table[tagged_word[1]].pop()

  # Leave out the space if it's punctuation.
  if not re.match("[\.,;:'!?]", word):
    output +=  " "

  # Accumulate the words
  output +=  word

# Remove white space.
output = output.strip()

print output