Sunday, June 23, 2024
How does TLS use less CPU than plain TCP, part II
I noticed over the past couple of days that the CPU utilization were similar between the two services, but looking at the logs it seemed my gopher server git hit with several bots. So last night just as the logs reset (a weekly job) I decided to restart the two services so as to match CPU utilization and requests. Twelve hour later and …
CPU | requests | |
---|---|---|
gopher | 0:28 | 175 |
gemini | 0:26 | 1744 |
A week and a half after calling “Mission Complete” and apparently not. Even worse, my gopher server is using a bit more CPU than Gemini server while getting one-tenth the number of requests. Something is off here.
I checked the requests and both receive over 90% requests for my blog, so I'm confident that it's something to do with that. Both use the same request framework, both use the same code to read blog entries, both use the same code to parse the HTML, so it's not that. The only difference left is the formatting of each entry. The gopher server does wrap the text to 80 columns (not needed for the Gemini server), and there's quite a bit more code to deal with fixing the links … are those the causes? It's hard to determine what's causing the excessive CPU utilization.
I'm going to have to think about how to determine the issue.
Heck, I'm going to have to think if this is even worth determining. I mean, the CPU utilization is better, but it's still puzzling.
The case of the weird profiling results
Okay, I'm flummoxed.
After a bit of thought, I had an idea on how to determine where the CPU time is going on my gopher server—I have at least two working hypotheses that I can test, wraping text and translating links. So, to test these, I will get a list of requests since the logs rotated earlier today, and then run them through the gopher server (a second one set up on my public server just for these tests) to get a baseline; then remove the link translations and run the requests again; then remove wraping the text and run the requests a third time and see what I get.
For the first test, the expected results should result with the test gopher server having the same CPU utilization as the one that normally runs. It makes sense right—the normal one has x CPU utilization with y requests, so if I feed the test gopher server the same requests (even though all the requests come from the same location) it should have the same CPU utilization. Or at least close enough to show that I'm on the right track.
When I pulled the requests out of the logs files and checked the current gopher server, it had received 332 requests and racked up 0:47 in CPU time. I set up the test gopher server (the only changes to the configuration—different log file and a different port which meant I didn't have to run as root). I made the 332 requests to my test gopher server and I get a CPU time of 0:22.
What?
I reran this test a few times and got the same results each time—0:22.
Which is less than half the time the normal gopher server with the same number of requests. If it was one or two seconds off, hey, close enough. But half?
It's not making sense.
But I decided to coninue and run the other tests. First, I removed the code that does the link translation, rerand the requests and got a CPU time of 0:20.
Okay, that tell me two things—one, the link translations do take time, but I don't think it's enough to explain the CPU utilization of the gopher server. Maybe. On a whim, I decided to change the third test to not even bother with processing blog entries—I modified the blog entry handler to just return. Given that ¾ of the requests are to the blog, this should make it run much faster and use less CPU time.
I got a CPU time of 0:18.
I don't know what to think. I would have expected this to be 0:05 or 0:06, given that 75% of the requests would not be generated. Something weird is going on.
It was time to take a step back. I went back to the original bench mark but instead of counting lines executed, I decided to count Lua VM instructions for the decently sized request.
gopher | 61622 |
gemini | 67401 |
Okay, the gopher server is clearly doing less Lua VM instructions than the Gemini server. Could I gain any insight from profiling at the C level? I had already done most of the work to profile both the gopher and Gemini servers. To make sure I got enough data, I ran the decently sized request three times for each server.
% time | cumulative seconds | self seconds | calls | self seconds ms/call | total ms/call | name |
---|---|---|---|---|---|---|
% time | cumulative seconds | self seconds | calls | self seconds ms/call | total ms/call | name |
13.79 | 0.04 | 0.04 | 185781 | 0.00 | 0.00 | luaV_execute |
10.34 | 0.07 | 0.03 | 734588 | 0.00 | 0.00 | index2value |
6.90 | 0.09 | 0.02 | 398225 | 0.00 | 0.00 | luaS_new |
6.90 | 0.11 | 0.02 | 45028 | 0.00 | 0.00 | luaH_newkey |
3.45 | 0.12 | 0.01 | 1041939 | 0.00 | 0.00 | yymatchChar |
3.45 | 0.13 | 0.01 | 924394 | 0.00 | 0.00 | luaG_traceexec |
3.45 | 0.14 | 0.01 | 503889 | 0.00 | 0.00 | yyText |
3.45 | 0.15 | 0.01 | 260252 | 0.00 | 0.00 | luaD_precall |
3.45 | 0.16 | 0.01 | 245893 | 0.00 | 0.00 | mainpositionTV |
3.45 | 0.17 | 0.01 | 201753 | 0.00 | 0.00 | auxgetstr |
3.45 | 0.18 | 0.01 | 191931 | 0.00 | 0.00 | yy_S |
3.45 | 0.19 | 0.01 | 185373 | 0.00 | 0.00 | equalkey |
3.45 | 0.20 | 0.01 | 134088 | 0.00 | 0.00 | yyDo |
3.45 | 0.21 | 0.01 | 129432 | 0.00 | 0.00 | yy_CHAR |
3.45 | 0.22 | 0.01 | 101937 | 0.00 | 0.00 | reverse |
3.45 | 0.23 | 0.01 | 34759 | 0.00 | 0.00 | luaH_getn |
3.45 | 0.24 | 0.01 | 4473 | 0.00 | 0.00 | getfirst |
3.45 | 0.25 | 0.01 | 1625 | 0.01 | 0.01 | traverseproto |
3.45 | 0.26 | 0.01 | 834 | 0.01 | 0.01 | strcore_wrapt |
3.45 | 0.27 | 0.01 | 61 | 0.16 | 0.16 | checkloops |
3.45 | 0.28 | 0.01 | 2 | 5.00 | 5.00 | deletelist |
3.45 | 0.29 | 0.01 | cclasscmp | |||
0.00 | 0.29 | 0.00 | 924377 | 0.00 | 0.00 | luaD_hook |
0.00 | 0.29 | 0.00 | 391575 | 0.00 | 0.00 | yymatchString |
0.00 | 0.29 | 0.00 | 358374 | 0.00 | 0.00 | luaH_getshortstr |
0.00 | 0.29 | 0.00 | 261889 | 0.00 | 0.00 | prepCallInfo |
0.00 | 0.29 | 0.00 | 261296 | 0.00 | 0.00 | luaD_poscall |
% time | cumulative seconds | self seconds | calls | self seconds ms/call | total ms/call | name |
---|---|---|---|---|---|---|
% time | cumulative seconds | self seconds | calls | self seconds ms/call | total ms/call | name |
9.38 | 0.03 | 0.03 | 1011065 | 0.00 | 0.00 | luaG_traceexec |
9.38 | 0.06 | 0.03 | 1011056 | 0.00 | 0.00 | luaD_hook |
7.81 | 0.09 | 0.03 | 204707 | 0.00 | 0.00 | luaV_execute |
6.25 | 0.10 | 0.02 | 443861 | 0.00 | 0.00 | luaS_new |
6.25 | 0.12 | 0.02 | 396470 | 0.00 | 0.00 | luaH_getshortstr |
3.12 | 0.14 | 0.01 | 668980 | 0.00 | 0.00 | index2value |
3.12 | 0.14 | 0.01 | 391575 | 0.00 | 0.00 | yymatchString |
3.12 | 0.15 | 0.01 | 271008 | 0.00 | 0.00 | mainpositionTV |
3.12 | 0.17 | 0.01 | 243043 | 0.00 | 0.00 | luaD_precall |
3.12 | 0.17 | 0.01 | 242834 | 0.00 | 0.00 | moveresults |
3.12 | 0.18 | 0.01 | 217137 | 0.00 | 0.00 | ccall |
3.12 | 0.20 | 0.01 | 202203 | 0.00 | 0.00 | hookf |
3.12 | 0.20 | 0.01 | 129432 | 0.00 | 0.00 | yy_CHAR |
3.12 | 0.21 | 0.01 | 56698 | 0.00 | 0.00 | llex |
3.12 | 0.23 | 0.01 | 39543 | 0.00 | 0.00 | internshrstr |
3.12 | 0.23 | 0.01 | 30301 | 0.00 | 0.00 | luaM_malloc_ |
3.12 | 0.24 | 0.01 | 23821 | 0.00 | 0.00 | luaH_realasize |
3.12 | 0.26 | 0.01 | 5906 | 0.00 | 0.00 | luaV_concat |
3.12 | 0.27 | 0.01 | 4149 | 0.00 | 0.00 | GCTM |
3.12 | 0.28 | 0.01 | 1872 | 0.01 | 0.02 | yy_PCDATA |
3.12 | 0.28 | 0.01 | 928 | 0.01 | 0.01 | match |
3.12 | 0.29 | 0.01 | 787 | 0.01 | 0.01 | luaF_newLclosure |
3.12 | 0.30 | 0.01 | 595 | 0.02 | 0.02 | luaK_int |
3.12 | 0.32 | 0.01 | 59 | 0.17 | 0.17 | l_strcmp |
1.56 | 0.32 | 0.01 | 12 | 0.42 | 0.42 | luaV_finishOp |
0.00 | 0.32 | 0.00 | 1041939 | 0.00 | 0.00 | yymatchChar |
This is not easy to interpret.
As expected,
the Lua VM shows up in the top spots for both,
but nothing really stands out.
It is nice to see that yymatchChar
and yymatchString
(both in the HTML parsing module)
are called the same number of times
(expected)
but the times are different.
The Lua function index2value
is called a different number of times,
but that might be due to code differences.
I think the higher percentage of time in the gopher server might be due to taking less time overall?
For instance,
in the gopher server,
deletelist
has a self ms/call of 5,
but on the Gemini server it has a self ms/call of 0.52—is the garbage collector being called more often in the gopher server?
Or is it because of less time overall,
it shows up with a higher time?
It's only called twice in both codebases.
I'm not sure how to interpret this.
Somehow, the gopher server is doing less work than the Gemini server, yet accumulating more CPU time than the Gemini server, despite getting about 10% of the requests as the Gemini server. I don't understand how that can be.
I think I'll leave things alone for now. I went several years with a not-optimized gopher server, so I think I can live with this new version for now.