The Boston Diaries

Tuesday, April 01, 2008

Hey! What happened to the Ides of March?

Oh man, is it April First already?

XXXX!

I never did get around to mucking with the stylesheets this year (unlike the previous four years).

Sigh.

Maybe next year.

Compression fever

It's one thing to trick someone on April Fools' Day, but to trick yourself really takes talent.

I have that talent.

Sigh.

I started this last week and didn't suspect a thing.

I came across The Hutter Prize, which will pay up to 50,000€ to compress a 100,000,000 byte file to less than 16,481,655 bytes (which includes the decompression program). The rules are straightforward, and to me, it certainly seems easier than The Netflix Prize.

At the very least, it couldn't hurt to play around a bit.

So last Thursday I downloaded the file to be compressed and played around with it. The file itself is nothing more than Wikipedia entries wrapped in XML. The XML only makes up 2.4% of the file—the rest is text (13.5% is nothing but the space character; 8% is the letter “e”).

Friday I started coding. Nothing difficult and it was pretty straightforward (although while the algorithm for Huffman encoding is pretty simple, writing it was surprisingly difficult; it took about five attempts before I was happy with the code). I did a few benchmarks against gzip and bzip2 and was surprised with the results:

My Sooperseekrit compression algorithm vs. `gzip` and `bzip2`
Program	Size of archive
orginal file	100,000,000
`gzip`	36,445,248
`bzip2`	29,008,736
my attempt	19,774,963

While not less than the required 16,481,655 bytes, it is in the ballpark, and I was quite surprised to beat the snot out of bzip2.

Not bad for a few hours of work.

So today, I'm hacking around with my program when I notice something odd—some of the Huffman encodings are the same.

That's not a good sign—each encoding should be unique. It takes a while (it takes almost 7½ minutes to compress 100,000,000 bytes with my program) but I find the problem—a bug in the encoding routine. Building the Huffman encoding table was fine—it was reading the encoded values that was buggy (and dropping bits).

How buggy?

My *fixed* Sooperseekrit compression algorithm vs. `gzip` and `bzip2`
Program	Size of archive
orginal file	100,000,000
`gzip`	36,445,248
`bzip2`	29,008,736
my attempt	35,947,593

Um … yeah.

It's not the first time my hopes for vast fortunes were dashed because of a bug fix.

Sigh.

Wednesday, April 02, 2008

Peer-to-Peer networking was a reality, and can be yet

I do see NAT getting pushed further and further out into the cloud, which can (and does) disconnect people from important places like work and home. At some point the frog is going to roast in the boiling water.

I had a public IPv6 discussion in australia recently. Click there for the full discussion. Let me reprint parts:

Two recent examples of NAT is BAD:

1) A friend of mine had a video monitoring system on his storefront in San Juan Del Sur. He was behind quadruple NAT—his own, and the wireless provider there (of the 8 or so ISPs there, only one provides real ip addresses). His house, 1km away, had a different provider, different NAT— SIP between the two locations never worked, he's never got a working vpn, and a few other difficulties like that—but the real kicker: One day—he got robbed—the perps stole everything—including some of the video monitoring system—and because he couldn't monitor his site from his house 1km away, he has no idea who it was.

2) I was trying to get universal internet access out to 26 barrios in a 40 mile wide area—so, for example, a teacher in one location could video out to multiple locations—but again, due to the all the service providers involved, doing NAT, proved utterly impossible.

mirror.internode.on.net and IPv6

DHCP, IPv4, home networks, and IPv6 … with DNS

I agree with Mike here that NAT is eeeeeeeeeeeeeeevil and has damaged the Internet (not as much as Microsoft has done, granted) to the point where it's barely a peer-to-peer network.

I do remember a time, back in the 90s, when every computer on the Internet was a true peer of every other computer on the Internet. I wanted to communicate with someone? My communications went from me, to my computer, to their computer, to them. There was no third party like AOL or GMail arbitrating our conversation (I remember at the time, the IRM department at FAU wanted to control all email and at an interdepartmental meeting, about half the departments said “Hell no!”).

Gone are the days when I had a block of public IP addresses for my home network (once in the 90s, and once just a few years ago). Now, I have to decide which computer gets ssh access from outside, and which gets HTTP access.

IPv6 looks to be a solution, bringing back true peer-to-peer communications, and the work Mike is doing is inspiring me to play around with IPv6 more than I have (which isn't all that much).

Well, that, and free porn …

Friday, April 04, 2008

“Bacon! Is there nothing it can't do?”

I'm sitting at my desk luxuriating in the olfactory bliss of Bacon Salt. Three small bags worth actually, original, hickory, and peppered, ready to be taken to Casa New Jersey to be consumed by an appreciative audience.

Monday, April 28, 2008

Funny how I haven't seen this anti-spam technique bandied about

The past month has been a continual fight with email. It got to the point where I sat down and designed an entirely new email system in the hopes that it would stop spam once and for all (based upon some ideas from Dan Bernstein) and since I've been mulling over it, trying to find flaws in it.

And find I did.

The system involves three players, the mail client (MC—an MUA in SMTP talk), an outgoing mail server (OMS) and an incoming mail server (IMS—under SMTP, there's a single server, called an MTA, that handles both). There's one protocol between an MC and an OMS, another protocol between an MC and an IMS, and two protocols for communication between an OMS and an IMS. There were also restrictions about what can talk to what; an OMS can only talk to the designated IMS for a domain (much like sending email to an MX record). Conversely, an IMS can only accept connections from a designated OMS from the sender domain (much like what SPF tries to do). An MC needs to be authenticated to the IMS/OMS (much like you need authentication for POP and IMAP to receive email, and some sites now require SMTP AUTH for outgoing email).

Yes, I'm glossing over a lot of details here, but that's the overview. ISPs would still filter mail client traffic, much like they do now. The enforcement of the sending server would pretty much stop joe jobs, and using a notification scheme with the mail spooled on the sender side would eliminate most, if not all, bounce messages.

So far, it seemed like a great scheme.

Until I realized that spammers would then just register tons of domains (or cut deals with domain name registrars to use “just expired but we'll keep it active until we can sell it again” domains) to send spam.

So the only thing I really did was find a way to stop joe jobs; it doesn't really stop spam all that much, and thus, the flaw.

But one remark from Wlofie (I ran the whole system past him a few days ago) lept out at me—server signatures (one of the optional bits in one of the protocols was a digital signature of the sender; Wlofie suggested I include a digital signature of the server as well). We already have server signatures for websites. And when I realized that, I realized a solution for spam. And one that can be adapted with our current email system.

First, revise the SMTP specification. Remove literal network addressing—that is, the ability to send email to an arbitrary IP address is no longer allowed. If the host portion of an email address does not have an MX record, the email can't be delivered. On the recpient side, make the use of SPF records mandatory, and they must be checked. Also, revise the SPF specification to remove the “SoftFail” and “Neutral” results.

This last step is the controversial one (as if the others weren't already)—SMTP servers must have a signed secure certificate and the protocol must be run over an encrypted channel, similar to how HTTPS works. And if either side has an expired or revoked certificate, the other side must refuse email.

What does this gain us?

Accountability.

Getting spam from a few hundred domains? Find out who sold them the signed certificate and send the complaint there. After a few hundred (thousand? Hundred thousand?) complaints and the easiest way handle the situation is to revoke the signature. Sure, the spammer can try bribing the certificate authority, but that's exactly what's missing from today's anti-spam techniques—hitting the spammer where it hurts! And if the spammer tries to use a self-signed certificate? Who would trust it?

Sure, it's an expense to get a signed certificate, but in today's reality, you are either an individual using someone else's server for email (Gmail, Yahoo, your ISP) or you're a business and can afford it (as part of your hosting bill, or just outright, but hey, it's a business expense and can be written off).

I must have forgotten my cardboard programmer that day …

D'oh!

Smirk called today, saying a customer had a problem sending mail with one of their PHP scripts. The server in question was running my PHP/sendmail wrapper and the testing that Smirk did showed that the PHP mail() function wasn't returning anything! Funny, for a function that supposedly returns a bool …

With Wlofie playing the part of cardboard programmer, I did some testing, found that indeed, there was a problem—at first, it looked like the system was terminating the program with random signals. One time it would terminate with a SIGXFSZ, then with SIGTTIN, then with SIGWINCH!

I then stared at the code until I bled …

else /* parent process */
{
  pid_t cstat;
  int   status;

  cstat = waitpid(child,&status,0);

  if (WIFEXITED(cstat))
    rc = WEXITSTATUS(cstat);
  else
    rc = EXIT_FAILURE;

  unlink(tmpfile); /* make sure we clean up after ourselves */
  exit(rc);
}

It was then i saw my mistake—I was checking the wrong variable!

Sigh.

The type of mistake a statically typed language should catch. And before you say “unit testing” my tests were basically “did the email go through? Yup? Then it works”—the thought to check the return code of my program as a whole didn't occur to me (hey, the email got through, right? that meant that it worked).

I changed WIFEXITED(cstat) to WIFEXITED(status) and WEXITSTATUS(cstat) to WEXITSTATUS(status) and it worked.

I also checked PHP, and for the life of me, I can't figure out why it was returning undef, but then again, PHP is the scripting language du jour so it may be I didn't check the precise version we're running.

Chocolate the old fasioned way

For Bunny, who loves chocolate—making your own from scratch (link via Flutterby).

Wednesday, April 30, 2008

Now that I think about it, I doubt spam will ever go away …

I was talking with G, our Cisco consultant about a networking issue I had when the talk turned to spam (as that's the root cause of all our email problems at The Company, which currently comprise half the issues in our trouble ticket system). One of the surprising bits of information G conveyed was that spammers are now using /16 network blocks for spamming.

A /16 network block is 65,636 consecutive IP addresses and are rather hard to come by (the smallest block ARIN will hand out is a /20, which gives 5,120 consecutive IPs and even then, you have to pay quite a bit and justify the usage of said block). The spammer then runs through the /16 (probably using 256, or a /24, the smallest routable block [1] at a time) until it's been blocked, and then sells it, or returns it; the spammer then obtains another /16 to abuse.

G also talked about other companies providing anti-spam techniques (like Barracuda) and the more he talked, the more I realized that spam is never going away—there's too much money to be made, by both sides.

Hypothetically speaking, if a spammer approached us and offered tons of money to use some of our IPs (we actually have a few /20s) to spam, if the money was good enough … well … the money would have to be mad money … and even then … well … everybody has their price. Anyway (this is a hypothetical situation) we made money to facility spamming. The spammer obviously makes money somehow or he wouldn't be doing this. The anti-spam companies like Barracuda make money by offering spam fitering services that need to be continuously updated.

So, what incentive is there for the commercial anti-spam companies to see spam eliminated completely? (as in, spammers never spam anymore)

Hmmm … sounds like anti-virus companies (not that I'm saying that anti-spam companies spam, but that their incentives are centered around staying in business, and totally eliminating spammers is not conducive to their remaining in business).

I only hope this is my more cynical side talking here, and not reality.

Technically, the smallest routable block consists of four consecutive IP addresses (a /30). What I mean by “smallest routable block” in this context are routes that are accepted by the backbone routers.

Stone knives and bear skins, Part II

I briefly mentioned “Project Leaflet” before, with respect to separating logic, language and layout of an application (in this case, a PHP web application), possibly with the use of an IDE.

But the problem goes deeper—what if you need alternative versions of the language? Or logic?

In C, this is handled by conditional compilation:

#ifdef MSDOS
  fp = fopen("C:\\temp\\foobar","wb");
#elif defined(VMS)
  fp = fopen("SYS$USERS:[TEMP.FOOBAR]","wb");
#elif defined(UNIX)
  fp = fopen("/tmp/foobar","w");
#else
  fp = fopen("foobar","wb");
#endif
  if (fp == NULL)
  {
#if defined(UNIX)
    fprintf(stderr,"could not open /tmp/foobar\n");
    return(ENOENT);
#elif defined(MSDOS)
    fprintf(stderr,"could not open C:\\TEMP\\FOOBAR\n");
    return(ENOTFOUND);
#elif defined(VMS)
    fprintf(stderr,"cold not open SYS:USERS:[TEMP.FOOBAR]\n");
    return(ENOFILE);
#else
    fprintf(stderr,"could not open foobar\n");
    return(EXIT_FAILURE);
#endif
  }

As you can see, this method leaves a lot to be desired, but still, it's much better than what you get with PHP.

One of the design requirements for “Project Leaflet” is that it can use either MySQL or PostgreSQL. I've already gone through the code and abstracted out the database calls on the (okay, laughably incorrect) assumption that the SQL statements themselves won't require changing.

Ha ha.

Now granted, for the most part, the SQL statments are simple enough that either MysQL or PostgreSQL can run them without problem. But there are a few rough spots, like:

$query = "SELECT "
       . "  *, "
       . "  DATE_FORMAT(sent, '%b. %e, %Y at %l:%i%p') as datesent "
       . "FROM pl_emails WHERE id=$id";

PostgreSQL doesn't understand DATE_FORMAT(); no, it wants TO_CHAR(). To make things even more amusing, the format string is completely different:

$query = "SELECT "
       . "  *, "
       . "  TO_CHAR(sent, 'Mon DD YYYY at HH12:MMam') as datesent "
       . "FROM pl_emails WHERE id=$id";

So right now I'm looking at two codebases, separated by a common language. Sure, there are any number of methods to merge the two into a common codebase:

//---------------
// Variant 1
//---------------

	// would this even work, as it requires 
	// the use if $id ... 

$query = $db_view_query['all_by_date'];

//-----------
// Variant 2
//-----------

if ($db === "MySQL")
{
  $query = "SELECT "
       . "  *, "
       . "  DATE_FORMAT(sent, '%b. %e, %Y at %l:%i%p') as datesent "
       . "FROM pl_emails WHERE id=$id";
}
elsif ($db === "PostgreSQL")
{
  $query = "SELECT "
       . "  *, "
       . "  TO_CHAR(sent, 'Mon DD YYYY at HH12:MMam') as datesent "
       . "FROM pl_emails WHERE id=$id";
}
else
{
  // -------------------------------------
  // love the way the language separation 
  // was done ... 
  // -------------------------------------

  die ($lang['a_horrible_death']);

//----------------
// Variant 3
//----------------

$query = "SELECT "
       . "  *, "
       . $dbdatefunct . "(send,'$dbdateformat') as datasent "
       . "FROM pl_emails WHERE id=$id";

Each solution being worse than the previous one. At least C has the decisions being done at compile time; I'm stuck with runtime decisions, or with very gross self-modifying code (variant #3—yes, that's what that is, self-modifying code).

As it stands right now, I have two branches of the code, a MySQL version and a PostgreSQL version, and I'm wavering between keeping them separate or merging the two, and the “keep them separate” faction is winning. That's because I'm currently using git, which makes branching a no-brainer (no, truly—switching between branches is trivial and takes no time at all; yes, it's a bit clunky trying to keep a central repository using git, but the branching is worth the clunkiness). And git's merging capabilities means that propagating fixes between the branches is easy as well (for fixes that apply across all branches, obviously). git comes very close to the fine-grained revision control I talked about.

So, not only do I want find-grained revision control, but a way to say “these changes I'm making apply to all the branches, and these changes only to this branch over here.”

Tuesday, April 01, 2008

Hey! What happened to the Ides of March?

Compression fever

Wednesday, April 02, 2008

Peer-to-Peer networking was a reality, and can be yet

Friday, April 04, 2008

“Bacon! Is there nothing it can't do?”

Monday, April 28, 2008

Funny how I haven't seen this anti-spam technique bandied about

I must have forgotten my cardboard programmer that day …

Chocolate the old fasioned way

Wednesday, April 30, 2008

Now that I think about it, I doubt spam will ever go away …

Stone knives and bear skins, Part II

Obligatory Picture

Obligatory Contact Info

Obligatory Feeds

Obligatory Links

Obligatory Miscellaneous

Obligatory AI Disclaimer