Monday, Debtember 28, 2020
Yet more adventures in profiling
While I'm now satisfied with the memory usaage, I'm starting to watching the CPU utilization and noticed with some dismay that it's quite high (even with the new HTML parser being faster overall). I started an instance a day ago (the 27TH), and it already has accumulated 27 minutes, 35 seconds of CPU time. As a contrast, the web server has only accumulated 37 seconds of CPU time since the 25TH.
That's a large difference.
The server in question is written in Lua. I have another server written in Lua, and it has only accumulated 1 minute, 26 seconds since the 25TH.
There are two differences that might account for the discrepancy:
- one gets more requests than the other;
- one uses TLS, the other doesn't.
But to be sure, I need a controlled experiment. Since both servers basically do the same thing (mainly, serve up this blog via gopher and Gemini, and convert the HTML to text formats, thus the need for an HTML parser) it was easy enough generate a list of comparable requests for both, and profile the execution.
Unfortunately, profiling across shared objects doesn't necessarily work all that well (at least on Linux). I recompiled both Lua and all the Lua modules I use (at least the ones written in C) but it only showed the time spent in the main Lua VM and nowhere else.
I then spent the time constructing a self-contained executable (containing Lua, plus all the modules comprising the application) of port70 (the gopher server) and another one for GLV-1.12556 (the Gemini server). Pretty easy to do, if a bit tedious in tracking down all the modules to include in the executables. I didn't bother with any optimizations for these runs, as I'm trying to get a feel for where the time is spent.
I profiled each executable, making the same 1,171 requests (well, “same” meaning “requesting the same content”) to each program.
First, port70, the gopher server, straight TCP connection. It accumulated 14 seconds of CPU time with the profile run, and the results:
% time | cumulative seconds | self seconds | calls | self ms/call | total ms/call | name |
---|---|---|---|---|---|---|
18.06 | 0.56 | 0.56 | 33881 | 0.00 | 0.00 | luaV_execute |
17.74 | 1.11 | 0.55 | 986744 | 0.00 | 0.00 | match |
4.03 | 1.24 | 0.13 | 28481743 | 0.00 | 0.00 | lua_gettop |
3.55 | 1.35 | 0.11 | 22087107 | 0.00 | 0.00 | index2value |
2.58 | 1.43 | 0.08 | 11321831 | 0.00 | 0.00 | yymatchChar |
2.26 | 1.50 | 0.07 | 6478653 | 0.00 | 0.00 | touserdata |
2.26 | 1.57 | 0.07 | 2063343 | 0.00 | 0.00 | pushcapture |
1.94 | 1.63 | 0.06 | 2074113 | 0.00 | 0.00 | lua_getmetatable |
1.94 | 1.69 | 0.06 | 2068487 | 0.00 | 0.00 | auxgetstr |
1.61 | 1.74 | 0.05 | 2222138 | 0.00 | 0.00 | luaS_new |
1.29 | 1.78 | 0.04 | 5469355 | 0.00 | 0.00 | luaV_equalobj |
1.29 | 1.82 | 0.04 | 5239401 | 0.00 | 0.00 | luaH_getshortstr |
1.29 | 1.86 | 0.04 | 2042852 | 0.00 | 0.00 | luaL_checkudata |
1.29 | 1.90 | 0.04 | 1207086 | 0.00 | 0.00 | lua_tolstring |
1.29 | 1.94 | 0.04 | 1070855 | 0.00 | 0.00 | luaT_gettmbyobj |
1.29 | 1.98 | 0.04 | 175585 | 0.00 | 0.00 | internshrstr |
Nothing terribly surprising there. The function luaV_execute()
is not surprising,
as that's the main driver for the Lua VM.
match()
is from LPEG,
which is used for all parsing aside from HTML.
The function yymatchChar()
is from the HTML parser I wrote,
so again,
no terrible surprise there.
Now, GLV-1.12556, the Gemini server, using TLS. This accumulated 1 minute, 24 seconds of CPU time with the profile run. The results:
% time | cumulative seconds | self seconds | calls | self ms/call | total ms/call | name |
---|---|---|---|---|---|---|
8.06 | 0.10 | 0.10 | 30070 | 0.00 | 0.01 | luaV_execute |
7.26 | 0.19 | 0.09 | 1494750 | 0.00 | 0.00 | luaH_getshortstr |
5.65 | 0.26 | 0.07 | 11943 | 0.01 | 0.01 | match |
4.03 | 0.31 | 0.05 | 535091 | 0.00 | 0.00 | luaD_precall |
4.03 | 0.36 | 0.05 | 502074 | 0.00 | 0.00 | moveresults |
3.23 | 0.40 | 0.04 | 129596 | 0.00 | 0.00 | luaS_hash |
2.42 | 0.43 | 0.03 | 11321831 | 0.00 | 0.00 | yymatchChar |
2.42 | 0.46 | 0.03 | 4218262 | 0.00 | 0.00 | yyText |
2.42 | 0.49 | 0.03 | 3293376 | 0.00 | 0.00 | yymatchString |
2.42 | 0.52 | 0.03 | 1508070 | 0.00 | 0.00 | yyrefill |
2.42 | 0.55 | 0.03 | 377362 | 0.00 | 0.00 | luaH_newkey |
1.61 | 0.57 | 0.02 | 2898350 | 0.00 | 0.00 | index2value |
1.61 | 0.59 | 0.02 | 2531258 | 0.00 | 0.00 | lua_gettop |
1.61 | 0.61 | 0.02 | 1081241 | 0.00 | 0.00 | yy_CHAR |
1.61 | 0.63 | 0.02 | 230982 | 0.00 | 0.00 | luaV_equalobj |
1.61 | 0.65 | 0.02 | 174295 | 0.00 | 0.00 | luaV_finishget |
1.61 | 0.67 | 0.02 | 136553 | 0.00 | 0.00 | luaT_gettmbyobj |
1.61 | 0.69 | 0.02 | 129534 | 0.00 | 0.00 | internshrstr |
1.61 | 0.71 | 0.02 | 10363 | 0.00 | 0.00 | traversestrongtable |
1.61 | 0.73 | 0.02 | 4684 | 0.00 | 0.00 | lua_resume |
It's satisfying seeing the same number of calls to yymatchChar()
,
but the times are smaller overall and there's less of a spike with luaV_execute()
,
leading me to believe the time is actually being spent in TLS.
That isn't showing up because I haven't compiled the TLS library to be profiled,
and it's dynamically linked in anyway.
I'm fairly confident that TLS is sucking the CPU time and it's not necessarily my code.
I'm aprehensive about attemping to recompile the TLS library with profiling in mind but it is the next logical step if I want to know for sure.
Sigh.