More things to make you go “Hmmmmmm …”

Thursday, January 09, 2003

From: Ken Maier <XXXXXXXXXXXXXXXXXXX>
To: Sean Conner <sean@conman.org>
Subject: Something to suck your time …
Date: Tue, 7 Jan 2003 23:35:44 -0500

Here's more on the “Things that make you go ‘Hmmmmm …’” topic you posted on your site today. This site goes into much more detail … very interesting … warning, big time sink ahead:

http://www.unansweredquestions.net/

Fourty years later and we still don't know all the details about November 22^nd, so I'm wondering just how long it'll be before all the details about September 11^th come out.

I don't suppose anyone alive today will ever know for sure …

Friday, January 10, 2003

Notes on surviving a Slashdot Effect

If you read “meta” sites like Slashdot, Kuro5hin, Fark, Met4filter (natch), and Memepool you've probably encountered links to stories that you can't reach—namely because the act of linking to a server not prepared for massive traffic has brought down the server, or worse, put the hapless soul over their bandwidth cap denying any use to anyone for the rest of the month or day or whatever time period the ISP or hosting provider uses to allocate bandwidth.

The ethics of linking

Mark and I have often gone back and forth about what we would need to do to survive a slashdotting if we ever got linked. Most of the solutions we've come up with so far center on distributing the affected site(s) to other servers and round-robining (is that a term?) between them (or some other form of load balancing). So far, that hasn't been a problem (and thankfully—we both have fears of being slashdotted and finding the slagged remains of the 33MHz 486 that is currently our server).

But one of the suggestions in the “The ethics of linkage” is to redirect all requests back to Google as they can probably can't be slashdotted at all. By using mod_rewrite you can probably do something along the lines of:

RewriteEngine on
RewriteBase   /
# untested!  Use at own risk!
# be sure to change domain after "cache:" as needed
RewriteCond   %{HTTP_REFERER}% ^http://.*slashdot.org.*
RewriteRule   ^.*$ http://www.google.com/search?q=cache:boston.conman.org/$1 [R][L]

But it would only help if the URLs that are being slashdotted exist in the Google cache; otherwise it does no good. For instance this entry, the very one you are reading now, has yet (as of January 10^th, 2003) to be read and cached by Google, and it probably won't be cached for some time. So I can only hope that if this article gets slashdotted, it's after Google has googled it.

Which means that it is still a good idea to think of other ways of surviving a slashdotting, but for an ad-hoc method, this is probably a decent solution until we get something better into place.

Saturday, January 11, 2003

“Avast ye swabbies! Copyright and Trademark violations abound!”

From: Bob Apthorpe <XXXXXXXXXXXXXXXXXXXXX>
To: Sean Conner <sean@conman.org>
Subject: More run-ins with Nameprotect.com
Date: Sat, 11 Jan 2003 04:52:18 -0600

…

Early on, I found spiders from Cyveillance.com rummaging through a bunch of my dynamically-generated web pages (message boards, mailing list admin pages, etc.) Of course, there was no reverse DNS on the spider and it was claiming to be some version of Internet Explorer, but hitting pages once a second and crawling every day on a dynamically-generated calendar is a tip-off you're not dealing with a meth-addled web surfer. Rainman, perhaps, but definitely not a real human.

I don't have anything to hide but that's no justification for letting ill-mannered commercial robots rummage through the electronic equivalent of my sock drawer. I close the door when I'm in the bathroom. I wear pants. Modesty and privacy do not imply improper behavior. Besides, I have a few hundred megabytes of photos of improv comedy shows I've played in. I don't want my connection saturated because some anonymous robot was brainlessly and greedily slurping content that no human was ever going to enjoy, at least not in the way I intended. My network, my rules.

Email from Bob Apthorpe

Now I know blogger's readership figures are inflated. I checked and sure enough, Cyveillance came ripping through my site last month for 213 hits (that I didn't notice—I think I'm now down to 75 or so real human hits per day). Now, unlike NameProtect®'s rather terse use of Mozilla/4.7 as a user-agent, Cyveillance has gone the other extreme:

Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0);Mozilla/4.0 (compatible; MSIE 5.05; Windows NT 5.0);Mozilla/4.0 (compatible; MSIE 5.05; Windows NT 4.0);Mozilla/4.0 (compatible; MSIE 5.05; Windows NT 3.51)

I guess they're running their robot under Windows 2000 (reported as Windows NT5.0), Windows NT 4.0 and Windows NT 3.51 and want to cover all the bases.

Brand Protection Solution

Cyveillance's Brand Protection Solution helps companies actively protect their brand equity by returning control over online brand integrity and use. By identifying and providing detailed intelligence on sites leveraging a company's brand for their own commercial purposes, Cyveillance enables companies to transform the Internet from a branding liability to a high-impact branding medium.

With Cyveillance's Brand Protection Solution, clients are able to accomplish the following:

Create a more positive online branding experience for customers across the Internet;

Build customer loyalty through consistent representation and understanding of brand messages; and

Increase ROI and effectiveness of online branding initiatives.

Client Success Story

Many clients have leveraged Cyveillance's Brand Protection Solution to prevent revenue leakage and recoup lost dollars. For example, a large insurance agency leveraged Cyveillance's Brand Protection Solution because the client wanted to stop traffic diversion from its corporate Web site by other sites leveraging this client's name, logo and slogan to drive business. Cyveillance identified several hundred cases of sites diverting potential buyers away from this client's site. These cases included several in which the client's own agents were using the brand to drive traffic from the corporate Web site and others in which sites were using the recognizable name and logo in meta tags, URLs and titles.

With this knowledge, the client could immediately take action against the misrepresented sites, prevent further revenue leakage and strengthen brand equity.

Cyveillance Brand Management

Beautiful the way they phrase things, isn't it?

I would think that effective use of Google would be just as effective and possibly cheaper than hiring an outfit like NameProtect® or Cyveillance, but that's just me.

It would be nice if these sites would follow the Robots Exclusion protocol but nooooooooooooo!

My only consolation is that they find their way towards xxx.lanl.gov, because, you know, the name says it all, and besides, they just <SARCASM>loooooove robots</SARCASM> coming through their site.

Sunday, January 12, 2003

When shell scripts are faster than C programs

My stats program for The Boston Diaries basically consists of a shells script that calls a custom program (in C) to print out only certain fields from the website logfile which is then fed into a pipeline of some twenty invocations of grep. It basically looks like:

cat logfile | 				     \
escanlog -status 200 -host -command -agent | \
        grep Mozilla |                       \
        grep -v 'Slurp/cat' |                \
        grep -v 'ZyBorg' |                   \
        grep -v 'bdstyle.css' |              \
        grep -v 'screen.css' |               \
        grep -v '^12.148.209.196'|           \
        grep -v '^4.64.202.64' |             \
        grep -v '^213.60.99.73' |            \
        grep -v 'Ask Jeeves' |               \
        grep -v 'rfp@gmx.net' |              \
        grep -v '"Mozilla"' |                \
        grep -v 'Mozilla/4.5' |              \
        grep -v '.gif ' |                    \
        grep -v '.png ' |                    \
        grep -v '.jpg ' |                    \
        grep -v 'bostondiaries.rss' |        \
        grep -v 'bd.rss' |                   \
        grep -v 'favicon.ico' |              \
        grep -v 'robots.txt' |               \
        grep -v $HOMEIP

It's servicable, but it does filter out Lynx and possibly Opera users since I filter for Mozilla and then reject what I don't want. Twenty greps—that's pretty harsh, especially on my server. And given that more and more robots are hiding themselves it seems, the list of exclusions could only get longer and longer.

I think that at this point, a custom program would be much better.

So I wrote one. In C. Why not Perl? Well, I don't know Perl, and I have all the code I really need in C already; there's even a regex library installed on the systems I can call, so that, mixed with the code I already have to parse a website log file, and an extensive library of of C code to handle higher level data structures it wouldn't take me all that long to write the program I wanted.

First, start out with a list of rules:

# configuration file for filtering web log files

reject	host	'^XXXXXXXXXXXXX$'	# home ip

filter  status  '^2[0-9][0-9]$'

reject  command '^HEAD.*'
reject  command '.*bostondiaries\.rss.*'
reject  command '.*/robots\.txt.*'
reject  command '.*/bd\.rss.*'
reject  command '.*/CSS/screen\.css.*'
reject  command '.*/bdstyle\.css.*'
reject  command '.*\.gif .*'
reject  command '.*\.png .*'
reject  command '.*\.jpg .*'
reject  command '.*favicon\.ico.*'

accept  agent   '.*Lynx.*'

filter  agent   '.*Mozilla.*'

reject  agent   '.*Slurp/cat.*'
reject  agent   '.*ZyBorg.*'
reject  agent   '.*la2@unspecified\.mail.*'
reject  agent   '.*Ask Jeeves.*'
reject  agent   '.*Gulper Web Bot.*'
reject  agent   '^Mozilla$'
reject  agent   '^Mozilla/4.7$'
reject  agent   '^Mozilla/4.5 \[en\] (Win98; I)$'
reject  agent   '.*efp@gmx.net.*'
reject  host    '^4\.64\.202\.64$'
reject  host    '^63\.148\.99.*'

The second column is which field you want to check from the logfile while the last column is the regular expression to match on. The first column is the rule to apply; basically, filter will continue processing if the field matches the regular expression, otherwise that request is discarded. reject will automatically reject the request if the given field matches the regular expression, and accept will automatically accept the request (both of these are forms of short circuiting the evaluation). Once all the rules are finished processing (or an accept is run, the resulting request is printed.

So, for example, the first line there rejects any requests from my home network) (as requests from there would skew the results), and the second line will reject any request that doesn't result in a valid page. And we go on from there.

The programing itself goes rather quickly. That is good.

The program itself goes rather slowly. That is not good.

It is, in fact, slower than the shell script with twenty invocations to grep. It is, in fact, a C program that is beaten by a shell script.

That is really not good.

Time to profile the code. Now, it might be that the regex library I was calling was slow, but I discounted that—it should be well tested, right?

Right?

Results of profiling: each sample counts as 0.01 seconds.
% time	cumulative seconds	self seconds	calls	self ms/call	total ms/call	function
65.69	0.90	0.90	11468	0.08	0.08	read_line
8.03	1.01	0.11	11467	0.01	0.02	process_rules
5.84	1.09	0.08	265271	0.00	0.00	NodeValid
5.84	1.17	0.08	11467	0.01	0.01	read_entry
2.92	1.21	0.04	11495	0.00	0.00	mem_alloc

So right away we can see that read_line() is sucking up a lot of time here. Let's see what it could be:

char *read_line(FILE *fpin)
{
  int     c;
  char   *buffer = NULL;
  char   *np;
  size_t  size   = 0;
  size_t  ns     = 0;

  while(!feof(fpin))
  {
    if (size == ns)
    {
      ns = size + LINESIZE;
      np = realloc(buffer,ns);
      if (np == NULL) return(buffer);
      buffer = np;
      buffer[size] = '\0';
    }

    c = fgetc(fpin);
    if (c == '\n') return(buffer);
    buffer[size]   = c;
    buffer[++size] = '\0';
  }
  return(buffer);
}

Not much going on. About the only thing that might be sucking up time is the allocation of memory (since LINESIZE was set to 256). I ran a quick test and found that the longest line in the logfiles is under 1,000 characters, while the average size was about 170. Given the number of lines in some of the files (40,000 lines, which as logfiles go, isn't that big) and given that the code I have makes a duplicate of the line (since I break one of those up into individual fields) I'm calling malloc() upwards of 80,000 times!—more if the lines are exceptionally long.

Fast forward over rewriting and profiling. Since malloc() can be an expensive operation, skip that so we need to pass in the memory to use to read_line() and since we're going to make a copy of it anyway why not pass in two blocks of memory?

void read_line(FILE *fpin,char *p1,char *p2,size_t size)
{
  int          c;

  while((size-- > 0) && (c = fgetc(fpin)) != EOF)
  {
    if (c == '\n') break;
    *p1++ = *p2++ = c;
  }
  *p1 = *p2  = '\0';
}

Now, run it under the profiler:

Results of profiling: each sample counts as 0.01 seconds.
% time	cumulative seconds	self seconds	calls	self ms/call	total ms/call	function
80.93	2.97	2.97	1	2970.00	3670.00	process
8.99	3.30	0.33	11468	0.03	0.03	read_line
5.18	3.49	0.19	11467	0.02	0.02	read_entry
4.90	3.67	0.18	11467	0.02	0.02	process_rules
0.00	3.67	0.00	31	0.00	0.00	empty_string

Much better. Now the actual processing is taking most of the time, so time to test it again on the server.

Still slower than the shell script.

Grrrrrrr

I'm beginning to suspect that it's the call to regexec() (the regular expression engine) that is slowing the program down, but there is still more I can do to speed the program up.

I can remove the call to fgetc(). Under Unix, I can map the file into memory using mmap() and then it just becomes searching through memory looking for each line instead of having to explicitly call the I/O routines. Okay, so I code up my own File object, which mmap()'s the file into memory, and modify read_line() appropriately, compile and profile:

Results of profiling: each sample counts as 0.01 seconds.
% time	cumulative seconds	self seconds	calls	self ms/call	total ms/call	function
42.86	0.06	0.06	11467	5.23	5.23	process_rules
35.71	0.11	0.05	11467	4.36	4.36	read_entry
7.14	0.12	0.01	11468	0.87	0.87	file_eof
7.14	0.13	0.01	11467	0.87	0.87	read_line
7.14	0.14	0.01	1	10000.00	140000.00	process

Finally! process_rules() finally shows up as sucking up the majority of runtime. Test it on the server and … it's still slow.

Okay, so now I know that regexec() is making a C program slower than a shell script with twenty invocations of grep. Just to satisfy my own curiousity, I crank up the optimizations the compiler uses (using gcc -O4 -fomit-frame-pointer ...) and create two versions—one with the call to regexec() stubbed out and one with the call still in. I then run both on the server, timing the execution of each.

Timings of two programs—one calling `regexec()` and one not
	Stubbed version	`regexec()` version
User time	6.16	2842.71
System time	0.21	41.02
Elapsed time	00:06.38	48:41.89

Not quite seven seconds for the stubbed version, and almost an hour for the one calling regexec(). And this on this month's logfile, which isn't even complete yet (approximately 11,400 lines). I'm beginning to wonder just what options RedHat used to compile the regex library.

I then search the net for a later version of the library. It seems there really is only one which nearly everybody doing regular expressions in C uses, written by Henry Spencer and last updated in 1997 (which means it's very stable code, or no one is using C anymore—both of which may be true). I sucked down a copy, compiled and ran the regression tests it came with. It passed, so I recompiled with heavy optimzation (gcc -O4 -fomit-frame-pointer ...) and used that in my program:

Results of profiling: each sample counts as 0.01 seconds.
% time	cumulative seconds	self seconds	calls	self ms/call	total ms/call	function
60.68	1.96	1.96	424573	0.00	0.00	sstep
21.67	2.66	0.70	137497	0.01	0.02	smatcher
10.84	3.01	0.35	20124	0.02	0.11	sfast
2.79	3.10	0.09	137497	0.00	0.02	regexec
2.17	3.17	0.07	11467	0.01	0.28	process_rules

Okay, now we're getting somewhere. Try it on the server and it runs in less than a minute (which is tollerable, given the machine).

Finally!

I guess the main point of this is to make sure when you profile code, to make sure that any libraries you may be using are also being profiled! I could have saved quite a bit of time if I knew for a fact that it was the regex library that was slowing me down (I could have, in fact, just did the stub thing first and see if it was indeed my code, but alas … ). But even so, it was a fun exercise to do.

Tuesday, January 14, 2003

Russian Ark

Hence there was but a single shooting day with four hours of existing light. Thousands of people in front of and behind the camera simply had to work together perfectly. The Hermitage was closed and restored to its original condition allowing cinematographer Tilman Buttner to travel through the Museum through an equivalent of 33 studios, each of which had to be lit in one go to allow for 360-degree camera movements. All of this was accomplished within a vulnerable environment that holds some of the greatest art treasures of all time, from Da Vinci to Rembrandt. After months of rehearsals, 867 actors, hundreds of extras, three live orchestras and 22 assistant directors had to know their precise positions and lines.

Via kottke.org, Russian Ark: Production Notes

A movie filmed in one 90-minute take.

Wow!

While this wasn't the first film planned as a “real time” film (Hitchcock did one, but if you watch the film, you'll notice that about every 10 to 12 minutes the camera will track to a wierd location (like the back of someone's black coat) and then continue on—it was during these “tracking shots” that the action was stopped, camera reloaded, then filming resumed so that, if need be, any ten minute segment of that film could conceivably be reshot without destroying the “continuity” as it were) it's the first to be shot continuously in real time; no film reloading here.

Continuous tracking shots are hard to do. Robert Altman's The Player has as its opening shot the longest (at the time) and most technically complicated tracking shot in a film pushing the limits of a single film canister (and if you look closely at the shot, you'll see Robert Altman himself, pitching a sequel to The Graduate). Russian Ark, however, is orders of magnitude beyond that.

Eight hundred and sixty-seven actors!

My mind is boggling at the thought.

Information wants to be availale …

The reasons are clear enough: in an attention economy, the key is to capture customers and keep them focused. The dojinshi market does exactly that. Fans obsess; obsessions work to the benefit of the original artist. Thus, were the law to ban dojinshi, lawyers may sleep better, but the market for comics generally would be hurt. Manga publishers in Japan recognize this. They understand how “theft” can benefit the “victim,” even if lawyers are trained to make the thought inconceivable.

Via dive into mark, What lawyers can learn from comic books

I've linked to a few other articles where making intellectual property (books, music, comics) easily available helps sales in the long run, even if it may facilitate an apparent “pirate market” in the short run. And this article by Lawrence Lessig expresses that point all the more so (and while he is correct in his reference to the potential legal action by Sony against someone who hacked the Aibo, I can see Sony's side of the picture—they were trying to limit their liability if someone saw the information, hacked their Aibo and broke it, then tried to return it to Sony possibly despite langauge in their warantee that modifications to the Aibo will void it; Sonly has since changed their mind).

The Boston Diaries

Thursday, January 09, 2003