Harpo Jaeger dot com

The Text Reassembler

Tonight marks the launch of the newest entry to the projects section of this site, currently called the text reassembler (if you have a better idea for a name, please let me know). It’s something I started over a year ago that’s languished undeveloped for much of that time. I decided to put in a bit of time to get it viewable, and put it up as a work in progress.

The project is based off of something I saw demonstrated atbyProf. Allen Downey atOlin College in the fall of 2008 while I was visiting. Although I didn’t end up applying there, I really liked the presentation he gave, on various computing methods and some of the ways of using computers to produce humanistic output. Particularly, he demonstrated a program thataccepted inputted text, andbroke down that text into anassociative array with the following properties:

  • the array keys were each word in the array (so the text “hello how are you” would generate an array with keys “hello”, “how”, “are”, and “you”)
  • the value of each key was a non-associative array consisting of a list of every word thatfollowed the key somewhere in the text (so the text “I saw you and he saw her” would be (in PHP format): array(“I”=>array(“saw”),”saw”=>array(“you”,”her”),”you”=>array(“and”),”and”=>array(“he”),”her”=>array())

This sounds complicated, but bear with me; the next step will make it make more sense. Once the program has generated this array, it begins to iterate through it in the following manner:

  1. it takes key from the array and outputs it
  2. it selects a random item from the list of words following that key (the array stored in that key’s associate value) and outputs it
  3. then it jumps to thekey entry for that word, and moves back to step 2

It repeats this until it encounters a word that has nothing following it (the last word in the source text if it appears nowhere else) or until it’s output a specified amount of words (much more common). Thus, if you take any two adjacent words in the resulting text (which sounds uncannily similar in tone to the original and sounds like it should make sense, but is complete nonsense), you’ll be able to locate them, still adjacent, in the source text. It’s legitimately one of the most fascinating and beautiful things I’ve ever seen a computer do.

I came home determined to write a copy, and made a bit of progress. It lay around for a while, I did some more work on it some point, but I never completed it. I rediscovered it this evening and just manned up and made it work well enough to be publicly viewable. When finished, it will accept input in a text field, or will be able to read input from a specified URL. At the moment, though, it’s just dealing with block of text I hard-coded in (thus it’s not accepting any user input right now), and a word output limit imposed the same way. Both the source of the text and the length are displayed on the page.