Saturday, January 14, 2006
Processing realtime data from LaBrea
Since I'm stuck at The Hospital, and The Hospital doesn't have wireless, I might as well work on the real-time LaBrea data processing program. Loaded up the laptop with program fragments and libraries I may need and brought along my copy of Advanced Programming in the Unix® Environment by W. Richard Stevens, and am coding up a storm.
And yes, I'm coding this up in C. Not like I'm going anywhere in the next couple of days.
LaBrea spits out lines like:
1136832897 Initial Connect - tarpitting: 82.240.204.251 3334 -> XXX.XXX.XXX.XXX * 1136832897 Initial Connect - tarpitting: 82.240.204.251 3339 -> XXX.XXX.XXX.XXX 139 1136832897 Persist Trapping: 82.240.204.251 3334 -> XXX.XXX.XXX.XXX 445 * 1136832898 Persist Activity: 216.248.36.242 45285 -> XXX.XXX.XXX.XXX 135 1136832898 Persist Activity: 216.248.36.242 34589 -> XXX.XXX.XXX.XXX 135 * 1136832898 Persist Activity: 216.211.61.158 3862 -> XXX.XXX.XXX.XXX 135 1136832898 Persist Activity: 211.236.205.138 4459 -> XXX.XXX.XXX.XXX 139 *
There's some other stuff that's logged, but I'm primarily looking for lines like the above. Which means the information I can store will look something like:
struct tprecord { IP src; Port sport; IP dest; Port dport; time_t start; /* first time we see src:sport-dest:dport */ time_t last; /* time of last activity of src:sport-dest:dport */ size_t packets; };
Now, since we'll be handling a lot of connections, we need to store these
structures in a way that is easy to search through. Could use a hash table,
and we are keying off the src:sport→dest:dport
tuple, and
that's 10 bytes of pretty much random binary data, which is good for this
type of thing. But it's an open ended hash, not knowing how many entries
we'll be handling. There are several methods to handle overflows in a hash
table, but that still means at some point doing a linear search, and then
there's the overhead of having a hash table.
There's also storing the data in a tree. In fact, I even have code in C to manage a balanced tree (taken from Knuth). But the downside is—I only have code to add to the tree, not to delete from it. And it was hard enough to write as is.
And there's still the overhead of maintaining a tree structure.
So I'm using arrays. Less overhead and straightforward to manage. The only trick is keeping the array sorted in some order, but as long as you compare the structures in a consistent manner, that's not really an issue (and by the way, I'm sorting by source address, port, destination address and port, in that order).
To make it even easier, I have an array of structures (see above), and an array of pointers to said structures, and it's the pointer based array that is sorted (since it's faster to swap pointers than swap whole structures). And once sorted, you can use a binary search to find the record.
The default binary search in C, bsearch()
, returns either a
pointer to the found record, or NULL
if not found. Good for
most uses, but not quite what I want. In addition to either finding the
record or not, I'd also like to know where in the array the record
was found, and if not, where it would appear if it was in the
array. And C's bsearch()
does not return the index.
Why do I want that?
My thinking is that if the search (reguardless of what type of search it is) didn't find the record in a sorted array, it has the information as to where the missing record should be. And with the information as to where, it would be “trivial” to add the record at the right point in the array.
And that beats adding the new record at the end, and restorting (C's
qsort()
is Quick Sort, which on average is O(n lg
n)
running time, but the worse case is O(n2)
and the worst case is then the array is mostly sorted to begin with, which
is why I'd rather not resort after each addition).
And try as I might, I could not get a binary search to work that would also return the index. For now, I replaced the search with a linear search, until I can comb through some reference materials at home.
The other problem I'm having is the array manipulation code doesn't quite work. When the program starts, it has space to store X records. If it runs out of space, it increases the size of the various arrays until it hits some large number of records, then it will go through, purge records fitting some criteria.
Only it's not working properly.
Got a lot of work done on the program though, and it was nice to get into The Zone (even with a headache). At this point, I'm headed home for sleep and what looks to be yet another day at The Hospital.