Thursday, August 31, 2000
Over fifteen million pages right here …
The publicly indexable web contains an estimated 800 million pages as of February 1999, encompassing about 15 terabytes of information or about 6 terabytes of text after removing HTML tags, comments, and extra whitespace.
Accessibility and Distribution of Information on the Web [Steve Lawrence, Lee Giles, NEC Research Institute]
I've been thinking recently about the definition of a webpage (only because the work I've done may redefine what people consider a webpage. Maybe. We'll see). A quick scan of Conman Laboratories revealed 234 files that constitute what is commonly called a webpage. 234 pages is something like 0.0000003% of the indexed web (as of February 1999). Not a significant portion.
But that's only the part you see under www.conman.org. It took awhile to calculate, but bible.conman.org has 15,620,753 pages. Yup. A lowly 486SX-33 is serving up over fifteen million pages, which works out to be almost 2% of the indexed web.
That is, if it was indexed.
But still, fifteen million pages isn't anything to sneeze at. Even more amazing is that these fifteen million pages only consume something like 5M of disk space. Uncompressed. Not bad for a bunch of two bit pages, eh? (That's a joke. A rather bad joke based upon simple math but anyway … )
Basically, those 15,620,753 pages are nothing more than 15,620,753 partial ways of viewing one single work, the King James Bible. There isn't anything else comparable to it on the web.
Sure, there are online bibles were you can pull out a verse, chapter or book, but none that I know of allow you to arbitrarily select which portions to read , which starts to stretch the definition of what a webpage actually is.
And for the record, one of the “pages” is a file telling the various search engine indexers not to index these pages.