Processing realtime data from LaBrea

Saturday, January 14, 2006

Since I'm stuck at The Hospital, and The Hospital doesn't have wireless, I might as well work on the real-time LaBrea data processing program. Loaded up the laptop with program fragments and libraries I may need and brought along my copy of Advanced Programming in the Unix® Environment by W. Richard Stevens, and am coding up a storm.

And yes, I'm coding this up in C. Not like I'm going anywhere in the next couple of days.

LaBrea spits out lines like:

1136832897 Initial Connect - tarpitting: 82.240.204.251 3334 -> XXX.XXX.XXX.XXX *
1136832897 Initial Connect - tarpitting: 82.240.204.251 3339 -> XXX.XXX.XXX.XXX 139
1136832897 Persist Trapping: 82.240.204.251 3334 -> XXX.XXX.XXX.XXX 445 *
1136832898 Persist Activity: 216.248.36.242 45285 -> XXX.XXX.XXX.XXX 135
1136832898 Persist Activity: 216.248.36.242 34589 -> XXX.XXX.XXX.XXX 135 *
1136832898 Persist Activity: 216.211.61.158 3862 -> XXX.XXX.XXX.XXX 135
1136832898 Persist Activity: 211.236.205.138 4459 -> XXX.XXX.XXX.XXX 139 *

There's some other stuff that's logged, but I'm primarily looking for lines like the above. Which means the information I can store will look something like:

struct tprecord
{
  IP     src;
  Port   sport;
  IP     dest;
  Port   dport;
  time_t start; /* first time we see src:sport-dest:dport */
  time_t last;  /* time of last activity of src:sport-dest:dport */
  size_t packets;
};

Now, since we'll be handling a lot of connections, we need to store these structures in a way that is easy to search through. Could use a hash table, and we are keying off the src:sport→dest:dport tuple, and that's 10 bytes of pretty much random binary data, which is good for this type of thing. But it's an open ended hash, not knowing how many entries we'll be handling. There are several methods to handle overflows in a hash table, but that still means at some point doing a linear search, and then there's the overhead of having a hash table.

There's also storing the data in a tree. In fact, I even have code in C to manage a balanced tree (taken from Knuth). But the downside is—I only have code to add to the tree, not to delete from it. And it was hard enough to write as is.

And there's still the overhead of maintaining a tree structure.

So I'm using arrays. Less overhead and straightforward to manage. The only trick is keeping the array sorted in some order, but as long as you compare the structures in a consistent manner, that's not really an issue (and by the way, I'm sorting by source address, port, destination address and port, in that order).

To make it even easier, I have an array of structures (see above), and an array of pointers to said structures, and it's the pointer based array that is sorted (since it's faster to swap pointers than swap whole structures). And once sorted, you can use a binary search to find the record.

The default binary search in C, bsearch(), returns either a pointer to the found record, or NULL if not found. Good for most uses, but not quite what I want. In addition to either finding the record or not, I'd also like to know where in the array the record was found, and if not, where it would appear if it was in the array. And C's bsearch() does not return the index.

Why do I want that?

My thinking is that if the search (reguardless of what type of search it is) didn't find the record in a sorted array, it has the information as to where the missing record should be. And with the information as to where, it would be “trivial” to add the record at the right point in the array.

And that beats adding the new record at the end, and restorting (C's qsort() is Quick Sort, which on average is O(n lg n) running time, but the worse case is O(n²) and the worst case is then the array is mostly sorted to begin with, which is why I'd rather not resort after each addition).

And try as I might, I could not get a binary search to work that would also return the index. For now, I replaced the search with a linear search, until I can comb through some reference materials at home.

The other problem I'm having is the array manipulation code doesn't quite work. When the program starts, it has space to store X records. If it runs out of space, it increases the size of the various arrays until it hits some large number of records, then it will go through, purge records fitting some criteria.

Only it's not working properly.

Got a lot of work done on the program though, and it was nice to get into The Zone (even with a headache). At this point, I'm headed home for sleep and what looks to be yet another day at The Hospital.

The Boston Diaries

Saturday, January 14, 2006

Processing realtime data from LaBrea

Obligatory Picture

Obligatory Contact Info

Obligatory Feeds

Obligatory Links

Obligatory Miscellaneous

Obligatory AI Disclaimer