Saturday, November 29, 2014
THE QUANTUM SUPPOSITION OF OZ
More and more Dorothy wondered how and why the great giants had ever submitted to become slaves of such skinny, languid masters …
One of the better turn of phrase from The Quantum Supposition of Oz
I'm done. I finished NaNoGenMo in only a
few hours total of work. I decided against Racter
vs. ELIZA
because of
the technical challenges. It's easy enough to find source code I can
understand to some version of ELIZA
, the same can't be
said for Racter
. The
code I do have is nearly incomprehensible with no documentation other than
the output of the program itself.
That in and of itself wouldn't be a show-stopper—I do have a running copy
of Racter
, but it's an MS-DOS executable that I have to run under an emulator, so
piping the the output from ELIZA
to Racter
and back
again is not a trivial problem that can be solved in the few remaining days
left of NaNoGenMo. Pity,
really, as the output would be most amusing to read.
So I fell back to the old stand-by—Markov chains. The input I used for the Markov chaining process (more on that below) was the entire works of Oz by L. Frank Baum. I can't say why I picked those, other than I had already downloaded them from Project Gutenberg some years ago and had them handy. And they are in the public domain, so anybody can butcher them.
Now a Markov chain is pretty straight-forward—I used an order-3 Markov chain. So you start with three words, say “the Wicked Witch.” That's your start, and you output that. Then you find each word that follows that phrase and count the number of times they occur:
word | count |
---|---|
of | 22 |
. | 10 |
, | 9 |
was | 7 |
and | 6 |
had | 5 |
has | 2 |
really | 1 |
discovered | 1 |
conquered | 1 |
a | 1 |
who | 1 |
merely | 1 |
said | 1 |
dies | 1 |
put | 1 |
or | 1 |
before | 1 |
died | 1 |
enchanted | 1 |
surrounded | 1 |
ruled | 1 |
is | 1 |
took | 1 |
looked | 1 |
laughed | 1 |
¶ | 1 |
realized | 1 |
came | 1 |
And from there, you can calculate the precentage chance of a given word following “the Wicked Witch:”
word | chance of following |
---|---|
of | 25.29 |
. | 11.49 |
, | 10.34 |
was | 8.05 |
and | 6.90 |
had | 5.75 |
has | 2.30 |
really | 1.15 |
discovered | 1.15 |
conquered | 1.15 |
a | 1.15 |
who | 1.15 |
merely | 1.15 |
said | 1.15 |
dies | 1.15 |
put | 1.15 |
or | 1.15 |
before | 1.15 |
died | 1.15 |
enchanted | 1.15 |
surrounded | 1.15 |
ruled | 1.15 |
is | 1.15 |
took | 1.15 |
looked | 1.15 |
laughed | 1.15 |
¶ | 1.15 |
realized | 1.15 |
came | 1.15 |
You then pick a word randomly, but based on the percentage chance (“of” is more likely than “came”) and say the choice is “of.” That's your next word you output. Now your three words are ”Wicked Witch of” and you do that process again and again until you get the desired number of words printed.
In my case, the initial words were three paragraph markers (¶) and the initial opening paragraphs that came out were:
THE WONDERFUL WIZARD OF OZ
CHAP . 17
The Shaggy Man laughed merrily .
" A prisoner is a captive , " replied Ozma , promptly .
" Just wonderful ! " declared the Lion , in a voice of horror .
" Oh , indeed ! " exclaimed the Pumpkinhead .
" I'd kick out with those long legs and large knees and feet . Below in the streets of the conquered city and the gardens and Rinkitink thought the best part of me then remaining . Moreover , there was little pleasure in talking with the goat they kept away from the others .
They now entered the great hall , his shaggy hat in his hands , was a big house , round , red cap held in place by means of its strings , securely around the Ork's neck , just where his belt was buckled . He rose into the air , for I can stand it if the others can . "
So Dorothy , who had gone on ahead , came bounding back to say that Dorothy and the Scarecrow and Ozma alone ; but Dorothy had been listening with interest to this conversation . Now she turned to her captives and said :
" Are you certain this is snow ? " she asked .
Yes, the spacing of the punctuation is a bit odd, and I'll get to that in a bit.
And the fact that I start with chapter 17 is a quirk of the Markov chaining process, as is the initial line of the novel, “THE WONDERFUL WIZARD OF OZ,” due to the initial three words selected (three paragraph markers).
Now, most of the time on this project was spent in two phases:
1. An initial editing of the Oz books from Project Gutenberg. I had to remove all the verbiage that didn't directly relate to the story. This included not only the text Project Gutenberg added, but Table of Contents and Introductions in each book, as well as page numbers and references to illustrations.
This was perhaps an hour or two of time—only one book had page numbers (thankfully, the other thirteen did not) and the text editor made light work of removing the image references. Most of the verbiage removed was located at the start and end of each book, so that was easy to cut.
2. Defining what a “word” was for the Markov chaning.
Seriously.
I spent more time on this than I did on the initial editing.
So, what is a word?
A quick answer is “letters surrounded by space.”
And that's good for about 95% of the words. But then you get stuff like “I'll” or “Dorothy's”. Then you expand the definition to “letters, with an embedded apostrophe, surrounded by space.” Then you come across “goin'” and you redefine yet again. Then you come across “Tik-tok” (a character in the story) or “Coo-ee-oh” and you redefine your definition yet again. Then you find “how-d” and “ye-do” and realize you need to handle “how-d'ye-do” and by now you realize you also missed “Dr.” and “Mr.” and “P. S.” and …
Yes, the definition of a “word” isn't quite so simple (oh, and then you come across entries like “No. 17”—sigh).
In the end, I defined a word as such (and in this order):
- A series of blank lines denotes a paragraph marker—¶.
- Punctuation (these two to avoid the dreaded “wall-of-text” you often get in generative text, but they're printed as words and thus, the odd spacing you see)
- “
--
”—these designate an m-dash, a typographical punctuation mark - Digits (but see below)
- “Mr.”
- “MR.”
- “Mrs.”
- “MRS.”
- “Dr.”
- “DR.”
- “P. S.” (and the variation “P.S.”)
- “T. E.” (and the variation “T.E.”—stands for “Thoroughly Educated”)
- “Gen.” (short for “General”)
- “No. ” followed by digits (no real reason for that—I just did it that way)
- “N. B.” (and the variation “N.B.”)
- “H.” (an initial)
- “M.” (an initial)
- “O.” (an initial)
- “Z.” (an initial)
- A few really complicated rules to catch “how-d'ye-do” but avoid making
a word out of “me
--
please” (some context: “don't strike me– please don't”).
Then all that was left was to generate a few novels (about a minute or two) and pick one that at least starts off strong and there you have it, a novel.
Oh, and the code that generated this awful dreck, should you be interested.