Monday, April 11, 2005
Updating servers
There's an interesting project lumbering down the pike (I do hope we get
it) but it involves a large number of machines acting as one—I think the
marketing term would be “enterprise cluster.” One of the interesting
challenges involves updating each of the servers (did I mention there being a
large number of them?) with content. There will be one master control
processor (the MCP and N
slaves. If the MCP copies the files out to each slave, then it's O(n)
as an
upper bound. However, if in addition to the MCP copying files, each slave starts copying as soon as it
receives a copy, then it drops to O(log n)
which is much
better.
But is there a faster way?
Ideally, the files will be coped to each node, but is there a way to broadcast the file …
Broadcast.
IP has the concept of broadcasting.
Unfortunately, while IP can send
broadcast packets, the reliable byte-oriented TCP protocol (built on top of IP) doesn't. TCP is meant to be a reliable data stream between two
processes—a “one-to-one” type of communication. No, a broadcast copy will
require the use of UDP, a
connectionless protocol that is, unfortunately, unreliable (well, so is IP, but TCP includes mechanisms for reliability). But if it can be
made to work, the time to copy a file to N nodes drops from O(log
N)
to O(1)
—as fast as you can get.
And it seems like I'm not the only one to latch onto this idea.
[root unix]# ./mdp mdp: Error! Must pick "client" mode (-D <cache/archive directory>), "server mode" (i.e. supply file(s) to transmit), or both. Use "-h" option for usage help. mdp: error starting mdp [root unix]# ./mdp -h CmdMdpApp::ProcessCommands() invalid flag mdp: error processing command-line options [root unix]#
(Sigh.)
Aside from programs that don't understand their own options, another aspect we're facing is adminstrating a large number of machines (and not the few dozen we have now). Towards that end, I've been reading all the articles at Infrastructures.org—a type of “best practices” in adminstrating a large number of systems.
The enterprise cluster concept simplifies how we maintain individual hosts. Upon adopting this mindset, it immediately becomes clear that all nodes in an enterprise cluster infrastructure need to be generic, each providing a commodity resource to the infrastructure. It becomes a relatively simple operation to add, delete, or replace any node.
The one bit I do have problems with is the “Push vs. Pull” argument:
We swear by a pull methodology for maintaining infrastructures, using a tool like SUP, CVSup, an rsync server, or cfengine. Rather than push changes out to clients, each individual client machine needs to be responsible for polling the gold server at boot, and periodically afterwards, to maintain its own rev level.
I'm not a fan of polling—the majority of polls won't return any new information and is just extra work on the part of the poller (and in this case, the extra network traffic). Also unspecified is how to handle the spike in traffic when there is an update. I'm sure in practice that each machine is set to semi-randomly poll the “gold server” least it suffer a network interface meltdown when hundreds—nay, thousands, of clients hit it at once for an update.
But their arguments about overly convoluted distribution scripts are compelling too. But in looking at broadcast copying (or multicast copying, take your pick) I'm wondering if a metter approach to the strict polling would be periodic broadcasting that an update is ready, and each server that receives the broadcast then contact the “gold server” for updates (and I'm talking about system updates, not just content being sent out).
Much to think about …
When multicast is slower than unicast
So I'm playing around with MDP and I am underwhelmed by its performance. I
have it installed on five machines. The file I'm transfering is about 4.5MB
in size and can be transferred in under 30 seconds using scp
(and that's including the time it takes for me to type in a password), so to
transfer the file to four machines (the fifth is doing the sending) should
take around two minutes.
The first test I stopped after twelve minutes. Okay, so one of the machines is a bit flaky, so I decided to drop that one from the test and run the test again.
Ten minutes.
Also, because of the way multicast works, the sending process doesn't exactly end when the clients have the data just in case another process joins to receive the broadcast (I'm guessing, the documentation for MDP is scant and the help feature doesn't work) so this may not work for what we want to do.