The Boston Diaries

The ongoing saga of a programmer who doesn't live in Boston, nor does he even like Boston, but yet named his weblog/journal “The Boston Diaries.”

Go figure.

Saturday, November 29, 2014

THE QUANTUM SUPPOSITION OF OZ

More and more Dorothy wondered how and why the great giants had ever submitted to become slaves of such skinny, languid masters …

One of the better turn of phrase from The Quantum Supposition of Oz

I'm done. I finished NaNoGenMo in only a few hours total of work. I decided against Racter vs. ELIZA because of the technical challenges. It's easy enough to find source code I can understand to some version of ELIZA, the same can't be said for Racter. The code I do have is nearly incomprehensible with no documentation other than the output of the program itself.

That in and of itself wouldn't be a show-stopper—I do have a running copy of Racter, but it's an MS-DOS executable that I have to run under an emulator, so piping the the output from ELIZA to Racter and back again is not a trivial problem that can be solved in the few remaining days left of NaNoGenMo. Pity, really, as the output would be most amusing to read.

So I fell back to the old stand-by—Markov chains. The input I used for the Markov chaining process (more on that below) was the entire works of Oz by L. Frank Baum. I can't say why I picked those, other than I had already downloaded them from Project Gutenberg some years ago and had them handy. And they are in the public domain, so anybody can butcher them.

Now a Markov chain is pretty straight-forward—I used an order-3 Markov chain. So you start with three words, say “the Wicked Witch.” That's your start, and you output that. Then you find each word that follows that phrase and count the number of times they occur:

Frequency of words following “the Wicked Witch”
wordcount
of22
.10
,9
was7
and6
had5
has2
really1
discovered1
conquered1
a1
who1
merely1
said1
dies1
put1
or1
before1
died1
enchanted1
surrounded1
ruled1
is1
took1
looked1
laughed1
1
realized1
came1

And from there, you can calculate the precentage chance of a given word following “the Wicked Witch:”

Precentage chance of a given word following “the Wicked Witch”
wordchance of following
of25.29
.11.49
,10.34
was8.05
and6.90
had5.75
has2.30
really1.15
discovered1.15
conquered1.15
a1.15
who1.15
merely1.15
said1.15
dies1.15
put1.15
or1.15
before1.15
died1.15
enchanted1.15
surrounded1.15
ruled1.15
is1.15
took1.15
looked1.15
laughed1.15
1.15
realized1.15
came1.15

You then pick a word randomly, but based on the percentage chance (“of” is more likely than “came”) and say the choice is “of.” That's your next word you output. Now your three words are ”Wicked Witch of” and you do that process again and again until you get the desired number of words printed.

In my case, the initial words were three paragraph markers (¶) and the initial opening paragraphs that came out were:

THE WONDERFUL WIZARD OF OZ

CHAP . 17

The Shaggy Man laughed merrily .

" A prisoner is a captive , " replied Ozma , promptly .

" Just wonderful ! " declared the Lion , in a voice of horror .

" Oh , indeed ! " exclaimed the Pumpkinhead .

" I'd kick out with those long legs and large knees and feet . Below in the streets of the conquered city and the gardens and Rinkitink thought the best part of me then remaining . Moreover , there was little pleasure in talking with the goat they kept away from the others .

They now entered the great hall , his shaggy hat in his hands , was a big house , round , red cap held in place by means of its strings , securely around the Ork's neck , just where his belt was buckled . He rose into the air , for I can stand it if the others can . "

So Dorothy , who had gone on ahead , came bounding back to say that Dorothy and the Scarecrow and Ozma alone ; but Dorothy had been listening with interest to this conversation . Now she turned to her captives and said :

" Are you certain this is snow ? " she asked .

The Quantum Supposition of Oz

Yes, the spacing of the punctuation is a bit odd, and I'll get to that in a bit.

And the fact that I start with chapter 17 is a quirk of the Markov chaining process, as is the initial line of the novel, “THE WONDERFUL WIZARD OF OZ,” due to the initial three words selected (three paragraph markers).

Now, most of the time on this project was spent in two phases:

1. An initial editing of the Oz books from Project Gutenberg. I had to remove all the verbiage that didn't directly relate to the story. This included not only the text Project Gutenberg added, but Table of Contents and Introductions in each book, as well as page numbers and references to illustrations.

This was perhaps an hour or two of time—only one book had page numbers (thankfully, the other thirteen did not) and the text editor made light work of removing the image references. Most of the verbiage removed was located at the start and end of each book, so that was easy to cut.

2. Defining what a “word” was for the Markov chaning.

Seriously.

I spent more time on this than I did on the initial editing.

So, what is a word?

A quick answer is “letters surrounded by space.”

And that's good for about 95% of the words. But then you get stuff like “I'll” or “Dorothy's”. Then you expand the definition to “letters, with an embedded apostrophe, surrounded by space.” Then you come across “goin'” and you redefine yet again. Then you come across “Tik-tok” (a character in the story) or “Coo-ee-oh” and you redefine your definition yet again. Then you find “how-d” and “ye-do” and realize you need to handle “how-d'ye-do” and by now you realize you also missed “Dr.” and “Mr.” and “P. S.” and …

Yes, the definition of a “word” isn't quite so simple (oh, and then you come across entries like “No. 17”—sigh).

In the end, I defined a word as such (and in this order):

  1. A series of blank lines denotes a paragraph marker—¶.
  2. Punctuation (these two to avoid the dreaded “wall-of-text” you often get in generative text, but they're printed as words and thus, the odd spacing you see)
  3. --”—these designate an m-dash, a typographical punctuation mark
  4. Digits (but see below)
  5. “Mr.”
  6. “MR.”
  7. “Mrs.”
  8. “MRS.”
  9. “Dr.”
  10. “DR.”
  11. “P. S.” (and the variation “P.S.”)
  12. “T. E.” (and the variation “T.E.”—stands for “Thoroughly Educated”)
  13. “Gen.” (short for “General”)
  14. “No. ” followed by digits (no real reason for that—I just did it that way)
  15. “N. B.” (and the variation “N.B.”)
  16. “H.” (an initial)
  17. “M.” (an initial)
  18. “O.” (an initial)
  19. “Z.” (an initial)
  20. A few really complicated rules to catch “how-d'ye-do” but avoid making a word out of “me--please” (some context: “don't strike me–please don't”).

Then all that was left was to generate a few novels (about a minute or two) and pick one that at least starts off strong and there you have it, a novel.

Oh, and the code that generated this awful dreck, should you be interested.

Obligatory Picture

[Here I am, enjoying my vacaton in a rain forest.]

Obligatory Links

Obligatory Miscellaneous

You have my permission to link freely to any entry here. Go ahead, I won't bite. I promise.

The dates are the permanent links to that day's entries (or entry, if there is only one entry). The titles are the permanent links to that entry only. The format for the links are simple: Start with the base link for this site: http://boston.conman.org/, then add the date you are interested in, say 2000/08/01, so that would make the final URL:

http://boston.conman.org/2000/08/01

You can also specify the entire month by leaving off the day portion. You can even select an arbitrary portion of time.

You may also note subtle shading of the links and that's intentional: the “closer” the link is (relative to the page) the “brighter” it appears. It's an experiment in using color shading to denote the distance a link is from here. If you don't notice it, don't worry; it's not all that important.

It is assumed that every brand name, slogan, corporate name, symbol, design element, et cetera mentioned in these pages is a protected and/or trademarked entity, the sole property of its owner(s), and acknowledgement of this status is implied.

Copyright © 1999-2017 by Sean Conner. All Rights Reserved.