The Boston Diaries

Wednesday, May 05, 2010

Millions of moving parts

In a system of a million parts, if each part malfunctions only one time out of a million, a breakdown is certain.

—Stanislaw Lem

In between paying work, I'm getting syslogintr ready for release—cleaning up the Lua scripts, adding licensing information, making sure everything I have actually works, that type of thing. I have quite a few scripts that isolated some aspects of working scripts—for instance, checking for ssh attempts and blocking the offending IP but weren't fully tested. A few were tested (as I'm using them at home), but not all.

I update the code on my private server, rewrite its script to use the new modules (as I'm calling them) only to watch the server seize up tight. After a few hours of debugging, I fixed the issue.

Only it wasn't with my code.

But first, the scenario I'm working with. Every hour, syslogintr will check to see if the webserver and nameserver are still running (why here? Because I can, that's why) and log some stats gathered from those processes. The checks are fairly easy—for the webserver I query mod_status and log the results; for the nameserver, I pull the PID from /var/run/named.pid and from that, check to see if the process exists. If they're both running, everything is fine. It was when both were not running that syslogintr froze.

Now, when the appropriate check determines that the process isn't running it not only logs the situation, but sends me an email to alert me of the situation. If only one of the two processes were down, syslogintr would work fine. It was only when both were down that it froze up solid.

I thought it was another type of syslog deadlock—Postfix spews forth multiple log entries for each email going through the system and it could be that too much data is logged before syslogintr can read it, and thus, Postfix blocks, causing syslotintr to block, and thus, deadlock.

Sure, I could maybe increase the socket buffer size, but that only pushes the problem out a bit, it doesn't fix the issue once and for all. But any real fix would probably have to deal with threads, one to just read data continuously from the sockets and queue them up, and another one to pull the queued results and process them, and that would require a major restructure of the whole program (and I can't stand the pthreads API). Faced with that, I decide to see what Stevens has to say about socket buffers:

With UDP, however, when a datagram arrives that will not fit in the socket receive buffer, that datagram is discarded. Recall that UDP has no flow control: It is easy for a fast sender to overwhelm a slower receiver, causing datagrams to be discarded by the receiver's UDP …

Hmm … okay, according to this, I shouldn't get deadlocks because nothing should block. And when I checked the socket receive buffer size, it was way larger than I expected it to be (around 99K if you can believe it) so even if a process could be blocked sending a UDP packet, Postfix (and certainly syslogintr wasn't sending that much data.

And on my side, there wasn't much code to check (around 2300 lines of code for everything). And when a process list showed that sendmail was hanging, I decided to start looking there.

Now, I use Postfix, but Postfix comes with a “sendmail” executable that's compatible (command line wise) with the venerable sendmail. Imagine my surprise then:

[spc]brevard:~>ls -l /usr/sbin/sendmail
lrwxrwxrwx  1 root root 21 Feb  2  2007 /usr/sbin/sendmail -> /etc/alternatives/mta
[spc]brevard:~>ls -l /etc/alternatives/mta
lrwxrwxrwx  1 root root 26 May  5 16:30 /etc/alternatives/mta -> /usr/sbin/sendmail.sendmail

Um … what the … ?

[spc]brevard:~>ls -l /usr/sbin/sendmail*
lrwxrwxrwx  1 root root      21 Feb  2  2007 /usr/sbin/sendmail -> /etc/alternatives/mta
-rwxr-xr-x  1 root root  157424 Aug 12  2006 /usr/sbin/sendmail.postfix
-rwxr-sr-x  1 root smmsp 733912 Jun 14  2006 /usr/sbin/sendmail.sendmail

Oh.

I was using sendmail's sendmail instead of Postfix's sendmail all this time.

Yikes!

When I used Postfix's sendmail everything worked perfectly.

Sigh.

mod_lua patched

And speaking of bugs, the bug I submitted to Apache was fixed!

Woot!

Monday, May 10, 2010

An update on the updated Greylist Daemon

Internet access at Chez Boca was non-existant today (scuttlebut: a fibre cut, which is currently the third one in about two months time) and without the Intarweb pipes, I can't work (good news! I get a day off! Bad news! I can't surf the web!), so I figured I would take the time to get a few personal projects out the door.

First one up—a new version of the greylist daemon has been released. a few bugs are fixed—the first being an error codition wasn't properly sent back to the gld-mcp. The second one a segmentation fault (not fatal actually—the greylist daemon restarts itself on a segfault) if it recieved too many requests (by “too many” I mean “the tuple storage is filled to capacity”—when that happens, it just dumps everything it has and starts over, but I forgot to take that into account elsewhere in the code). The last prevented the Postfix interface from logging any syslog messages (I think I misunderstood how setlogmask() worked).

The other changes are small (a more pedantic value for the number of seconds per year, adding sequence numbers to all logged messages (that's another post) and set the version number directly from source control) but the reason I'm pushing out version 1.0.12 now (well, aside from there was nothing else to do yesterday) is related to the one outstanding bug (that I know of) in the program. That bug deals with bulk data transfers (Mark has been bit by it; I haven't) and I suspect I know the problem, but the solution to that problem requires a incompatible change to the program.

Well, okay—the easy solution requires an incompatible change to the program. The problem is the protocol used between the greylist daemon and the master control program. The easy solution is to just change the protocol and be done with it; the harder solution would be to work the change into the existing protocol, and that could be done, but I'm not sure if it's worth the work. The number of people (that I know of) using the greylist daemon I can count on the fingers of one hand, so an incompatible change wouldn't affect that many people.

Then again, I might not hear the end of it from Mark.

In any case, there are more metrics I want to track (more on that below) and those would require a change to the protocol as well (or more technically, an extention to the protocol). The addtional metrics may possibly help with some long term plans that involve making the greylist daemon multithreaded (yes, the main page states I can get over 6,000 requests a second; that's a very conservative value—drop the amount of logging and I can handle over 32,000/second, all on a single core).

Some of the new metrics involve tracking the the lowest and highest number of tuples during any given period of time. This should help with fine-tuning the --max-tuples parameter (currently defaults to 65,536). I've noticed over the years that I don't seem to get much past 3,000 tuples at any one time, but I would like to make sure before I tweak the value on my mail server.

The other metrics I want to track are the number of tuple searches (or “reads”) and the number of tuple updates (additions or deletions, but in other words, “writes”). These metrics should help if I decide to go multithreaded with the greylist daemon, and how best to store the tuples depending if the application is read heavy or write heavy (but my guess is that reads and writes are nearly equal, which presents a whole set of challenges). But it's not like the program isn't fast as it is—while I claim over 6,000 requests per second, that's a rather conservative figure—drop the logging to a minimum and it can handle over 30,000 requests per second on a single core. I would be interested to see if I can improve on that with additional cores.

Since the there are quite a few changes that require protocol changes, I decided to just make the last release of the 1x line, and start work on version 2 of the greylist daemon—or maybe 1.5 and leave 2x for the multithreaded version.

A new project released!

I've made mention of my syslogd replacement, but since I had nothing else to do today, I buckled down, wrote (some) documentation and have now officially released syslogintr under a modified GNU GPL license.

It currently supports only Unix and UDP sockets (both unicast and multicast) under IPv4 and IPv6; is compatible with RFC-3164 (which documents current best practice and isn't a “standard” per se) but is fully scriptable with Lua. And because of that, it's probably easier to configure than rsyslogd and syslog-ng, both of which bend over backwards in trying to support, via their ever increasingly complex configuration files, all sorts of wierd logging scena—Oh! Intarwebs back up! Shiny!

Wednesday, May 12, 2010

Just gotta love stupid benchmarks

Because I like running stupid bench marks, I thought it might be fun to see just how fast syslogintr is—to see if it even gets close to handling thousands of requests per second.

So, benchmarks.

All tests were run on my development system at Chez Boca: 1G of RAM, 2.6GHz dual core Pentium D (at least, that's why my system is reporting). I tested syslogintr linked against Lua and against LuaJIT. All were run with messages (after being processed) being relayed to a multicast address (but one set without a corresponding listener, and another set with a corresponding listener).

The script being tested was using.lua from the current version; executable compiled without any optimizations.

And the results:

Number of messages per second processed by `syslogintr`
	Lua	LuaJIT
no multicast listener	10,250	12,000
multicast listener	8,400	8,800

Not terribly shabby, given that the main logic is in a dynamic scripting language. It would probably be faster if it skipped the relaying entirely and compiled with heavy optimizations, but that's a test for another day.

Update a few minutes later …

I forgot to mention—those figures are for a non-threaded (that is, it only runs on a single CPU) program. Going multithreaded should improve those figures quite a bit.

Saturday, May 15, 2010

Yet another new (actually, old) project released!

Two years ago I wrote a program to wrap Perl in order to catch script kiddie Perl scripts on the server.

Today, I'm deleting a huge backlog of information recorded by said program (normally under Linux, the physical directory (a list of files in said directory) is 4,096 bytes—the directory I'm trying to delete is 110,104,576 bytes in size (as a rough calculation means there's over 3,000,000 files in that directory).

And this isn't the first time I've deleted that directory.

But what I didn't realize is that I never got around to releasing the program.

Woooooooooooot! ~~Another project released!~~

Update on Monday, April 18^th, 2022

I've since unpublished the code. I'm not sure I even have the code anymore.

Saturday, May 29, 2010

Death by a thousand SQL queries

The Company just got hired to take over the maintenance and development of a mid-sized Web 2.0 social website that's the next Big Thing™ on the Internet. For the past week we've been given access to the source code, set up a new development server and have been basically poking around both the code and the site.

The major problem with the site was performance—loads exceeding 50 were common on both the webserver and database server. The site apparently went live in January and has since grown quickly, straining the existing infrastructure. That's where we come in, to help with “Project: SocialSpace2.0” (running on the ubiquitous LAMP stack).

The site is written with PHP (of course), and one of the cardinal rules of addresssing performance issues is “profile, profile, profile.”—the bottle neck is almost never where you think it is. Now, I've profiled code before, but that was C, not PHP. I'm not even sure where one would begin to profile PHP code. And even if we had access to a PHP profiler, profiling the program on the development server may not be good enough (the development server has maybe half the data of the production server, which may not have the pathological cases the production server might encounter).

So what to do as the load increases on the webserver?

Well, this answer to profiling C++ code gave me an idea. In one window I ran top. In another window a command line. When a particular instance of Apache hit the CPU hard as seen in top, I quickly get a listing of open files in said process (listing the contents of /proc/pid/fd to find the ofending PHP file causing the load spike).

Laugh if you will, but it worked. About half a dozen checks lead to one particular script causing the issue—basically a “people who viewed this profile also viewed these profiles” script.

I checked the code in question and found the following bit of code (in pseudocode, to basically protect me):

for viewers in SELECT userID
	  	FROM people_who_viewed
	  	WHERE profileID = {userid} 
		ORDER BY RAND()
  for viewees in SELECT profileID
	  	FROM people_who_viewed
	  	WHERE userID = {viewers['userID']}
		ORDER BY RAND()
    ...
  end
end

Lovely!

An O(n²) algorithm—in SQL no less!

No wonder the site was dying.

Worse, the site only displayed about 10 results anyway!

A simple patch:

for viewers in SELECT userID
	  	FROM people_who_viewed
	  	WHERE profileID = {userid} 
		ORDER BY RAND() LIMIT 10
  for viewees in SELECT profileID
	  	FROM people_who_viewed
	  	WHERE userID = {viewers['userID']}
		ORDER BY RAND() LIMIT 10
    ...
  end
end

And what do you know? The site is actually usable now.

Alas, poor Clusty! I knew it, Horatio: a search engine of infinite results …

YIPPY is foremost the world's first fully-functioning virtual computer. A cloud-based worldwide LAN, YIPPY has turned every computer into a terminal for itself. On the surface, YIPPY is one-stop shopping for the web surfing needs of the average consumer. YIPPY is an all-inclusive media giant; incorporating television, gaming, news, movies, social networking, streaming radio, office applications, shopping, and much more—all on the fastest internet browser available today.

Yippy » About

I wish I could say the above was a joke (but you should read the rest of the above page purely for its entertainment value—the buzzword quotient on that page is pure comedy gold), but alas, it is not. Some company called Yippy has bought Clusty and turned what used to be my favorite (if occasionally used) search engine into some LSD-induced happy land of conservative values:

Yippy.com, its sub-domains and other web based products (such as but not limited to the Yippy Browser) may censor search results, web domains and IP addresses. That is, Yippy may remove from its output, in an ad-hoc manner, all but not limited to the following:

Politically-oriented propaganda or agendas

Pornographic Material

Gambling content

Sexual products or sites that sell same

Anti-Semitic views or opinions

Anti-Christian views or opinions

Anti-Conservative views or opinions

Anti-Sovereign USA views or opinions

Sites deemed inappropriate for children

Yippy » Censorship

I cannot (even if I may agree with some of the above) in good conscience, endorse such censorship from a search engine, nor if I refuse to use it, force others to use it. Even my own site has gambling content on it so thus I too, could be censored (or at least my site from search results). Not to mention it encourages people to report “questionable content:”

Yippy users are our greatest defense against objectionable material. Should a keyword or website by found that returns this kind of material it may be reported in the CONTACT US tab located on the landing page of the Yippy search engine. Our staff will quickly evaluate all responses and reply back within 24 hours for a resolution notice. We thank you in advance for helping keep Yippy the greatest family friendly destination online today.

Yippy » Terms of Service

Thus, I'm going back to Google, and have removed Clusty (sigh) from my site.

Wednesday, May 05, 2010

Millions of moving parts

mod_lua patched

Monday, May 10, 2010

An update on the updated Greylist Daemon

A new project released!

Wednesday, May 12, 2010

Just gotta love stupid benchmarks

Obligatory Sidebar Links

Update a few minutes later …

Saturday, May 15, 2010

Yet another new (actually, old) project released!

Update on Monday, April 18^th, 2022

Saturday, May 29, 2010

Death by a thousand SQL queries

Alas, poor Clusty! I knew it, Horatio: a search engine of infinite results …

Obligatory Picture

Obligatory Contact Info

Obligatory Feeds

Obligatory Links

Obligatory Miscellaneous

The Boston Diaries

Obligatory Sidebar Links

Update a few minutes later …

Update on Monday, April 18th, 2022

Obligatory Picture

Obligatory Contact Info

Obligatory Feeds

Obligatory Links

Obligatory Miscellaneous

Update on Monday, April 18^th, 2022