Friday, May 01, 2020
It seems that C's bit-fields are more of a pessimization than an optimization
A few days ago, maybe a few weeks ago, I don't know, the days are all just merging together into one long undifferentiated timey wimey blob but I'm digress, I had the odd thought that maybe, perhaps, I could make my Motorola 6809 emulator faster by using a bit-field for the condition codes instead of the individual booleans I'm using now. The thought was to get rid of the somewhat expensive routines to convert the flags to a byte value and back. I haven't used bit-fields all that much in 30 years of C programming as they tend to be implementation dependent:
- Whether a “plain”
int
bit-field is treated as asigned int
bit-field or as anunsigned int
bit-field (6.7.2, 6.7.2.1).- Allowable bit-field types other than
_Bool
,signed int
, andunsigned int
(6.7.2.1).- Whether a bit-field can straddle a storage-unit boundary (6.7.2.1).
- The order of allocation of bit-fields within a unit (6.7.2.1).
- The alignment of non-bit-field members of structures (6.7.2.1). This should present no problem unless binary data written by one implementation is read by another.
- The integer type compatible with each enumerated type (6.7.2.2).
C99 standard, annex J.3.9
But I could at least see how gcc
deals with them and see if there is indeed a performance increase.
I converted the definition of the condition codes from:
struct { bool e; bool f; bool h; bool i; bool n; bool z; bool v; bool c; } cc;
to
union { /*--------------------------------------------------- ; I determined this ordering of the bits empirically. ;----------------------------------------------------*/ struct { bool c : 1; bool v : 1; bool z : 1; bool n : 1; bool i : 1; bool h : 1; bool f : 1; bool e : 1; } f; mc6809byte__t b; }
(Yes,
by using a union I'm inviting “unspecified behavior”—from the C99 standard: “[t]he value of a union member other than the last one stored into (6.2.6.1)”),
but at least gcc
does the sane thing in this case.)
The code thus modified,
I ran some tests to see the speed up and the results were rather disappointing—it was slower using bit-fields than with 8 separate boolean values.
My guess is that the code used to set and check bits, especially in an expression like (cpu->cc.f.n == cpu->cc.f.v) && !cpu->cc.f.z
was larger
(and thus slower)
than just using plain bool
for each field.
So the upshot—by changing the code to use an implementation-defined detail and invoking unspecified behavior, thus making the resulting program less portable, I was able to slow the program down enough to see it wasn't worth the effort.
Perfect.
Monday, May 04, 2020
On the plus side, it happens the same way each time; on the down side, it doesn't happen often
I'm still trying to find the cause of the crash.
I made a small change
(a switch from using lua_touserdata()
,
which can return NULL
,
to using luaL_checkudata()
,
which if it returns,
will never return NULL
; else it throws an error)
which didn't help resolve the issue—the program still dumped core.
Both last week and today,
it crashed in the same place in Ltls_write()
:
static int Ltls_write(lua_State *L) { struct tls **tls = luaL_checkudata(L,1,TYPE_TLS); size_t len; char const *data = luaL_checklstring(L,2,&len); lua_pushinteger(L,tls_write(*tls,data,len)); return 1; }
The variable tls
was itself not NULL
,
but the value it pointed to was! There's only one place in the code where it's set to NULL
—in the __gc
metamethod
(the bits with the Lua registry is to deal with callbacks from libtls
which knows nothing about Lua):
static int Ltls___gc(lua_State *L) { struct tls **tls = luaL_checkudata(L,1,TYPE_TLS); if (*tls != NULL) { lua_pushlightuserdata(L,*tls); lua_pushnil(L); lua_settable(L,LUA_REGISTRYINDEX); tls_free(*tls); *tls = NULL; } return 0; }
This is only called when the object no longer has any references to it. I'm not aware of any place where this value gets overwritten (but I'm not discounting it outright quite yet). The crashes started after I fixed the leak, which turned out to be this function (this is the pre-fix version):
static int Ltls_close(lua_State *L) { lua_pushinteger(L,tls_close(*(struct tls **)lua_touserdata(L,1))); return 1; }
The leak happened because I left the reference to the object in the Lua registry, so the garbage collector would never reclaim the resource. The fix itself was easy:
static int Ltls_close(lua_State *L) { int rc = tls_close(*(struct tls **)luaL_checkudata(L,1,TYPE_TLS)); if ((rc != TLS_WANT_POLLIN) && (rc != TLS_WANT_POLLOUT)) Ltls___gc(L); lua_pushinteger(L,rc); return 1; }
Yes, it's slightly complicated by the fact that closing a TLS connection might require a few more packets of data between the two endpoints to close out the protocol, but that's taken into account by the framework I'm using. At least, I hope it is.
So my theory right now is that the __gc
metamethod is being called before the connection is fully closed,
but to confirm that,
I somehow need to capture that scenario.
It's consistent in that it seems to be the same client making the same requests causing the same crash.
On the down side,
I have no idea who runs the client,
or where it is
(other than an IP address).
And if I make the same requests with my client,
it doesn't crash,
so I can't reproduce it.
And there's no telling when it will happen again.
I know it's just a small detail I'm missing on my side that's causing the issue,
but I just can't seem to locate the detail.
To that end,
I modified Ltls___gc()
and Ltls_write()
to attach some traceback information to see if my theory that Ltls___gc()
is being called too early.
Lua makes this pretty easy to do.
I modified Ltls___gc()
:
static int Ltls___gc(lua_State *L) { struct tls **tls = luaL_checkudata(L,1,TYPE_TLS); if (*tls != NULL) { luaL_traceback(L,L,"__gc",0); lua_setuservalue(L,1); lua_pushlightuserdata(L,*tls); lua_pushnil(L); lua_settable(L,LUA_REGISTRYINDEX); tls_free(*tls); *tls = NULL; } return 0; }
I get a traceback message and assign it to the object about to be cleaned.
In Lua,
a userdata goes through the garbage collection twice—once to clean the resources that Lua isn't aware of
(that's what this function does)
and the second time
(in the next garbage collection cycle)
to reclaim the memory Lua uses to store the userdata.
Then in Ltls_write()
I have:
static int Ltls_write(lua_State *L) { struct tls **tls = luaL_checkudata(L,1,TYPE_TLS); size_t len; char const *data = luaL_checklstring(L,2,&len); if (*tls == NULL) { lua_getuservalue(L,1); if (!lua_isnil(L,-1)) syslog(LOG_NOTICE,"%s",lua_tostring(L,-1)); luaL_traceback(L,L,"write",0); syslog(LOG_NOTICE,"%s",lua_tostring(L,-1)); lua_pop(L,2); lua_pushinteger(L,-1); } else lua_pushinteger(L,tls_write(*tls,data,len)); return 1; }
This will (hopefully,
knock on formica)
give me two tracebacks—one when the garbage collection is called
(or Ltls_close()
is called) and then a write is attempted,
and that will point me to the detail I'm missing.
The other theory is that I'm overwriting memory somewhere and that's a harder issue to track down.
Wednesday, May 06, 2020
The root cause of a crash
It happend yet again and this time, I was able to get the stack traces! Woot! But as is the case with most bugs, the fix was laughably trivial once found. The trouble is always in finding it, and all one can hope for is to learn for next time.
So the normal pattern of requests came in and the program crashes (only this time with the stack traces I was hoping to capture):
remote=XXXXXXXXXXXX status=20 request="gemini://gemini.conman.org/test/torture/" bytes=2505 subject="" issuer="" remote=XXXXXXXXXXXX status=20 request="gemini://gemini.conman.org/test/torture/0001" bytes=283 subject="" issuer="" remote=XXXXXXXXXXXX status=20 request="gemini://gemini.conman.org/test/torture/0002" bytes=249 subject="" issuer="" remote=XXXXXXXXXXXX status=20 request="gemini://gemini.conman.org/test/torture/0003" bytes=129 subject="" issuer="" remote=XXXXXXXXXXXX status=20 request="gemini://gemini.conman.org/test/torture/0004" bytes=205 subject="" issuer="" remote=XXXXXXXXXXXX status=20 request="gemini://gemini.conman.org/test/torture/0005" bytes=112 subject="" issuer="" remote=XXXXXXXXXXXX status=20 request="gemini://gemini.conman.org/test/torture/0006" bytes=248 subject="" issuer="" remote=XXXXXXXXXXXX status=20 request="gemini://gemini.conman.org/test/torture/0007" bytes=89 subject="" issuer="" remote=XXXXXXXXXXXX status=20 request="gemini://gemini.conman.org/test/torture/0008" bytes=360 subject="" issuer="" remote=XXXXXXXXXXXX status=59 request="" bytes=16 subject="" issuer="" __gc stack traceback: [C]: in method 'close' /usr/local/share/lua/5.3/org/conman/nfl/tls.lua:126: in method 'close' GLV-1.12556.lua:288: in function <GLV-1.12556.lua:232> write stack traceback: [C]: in method 'write' /usr/local/share/lua/5.3/org/conman/nfl/tls.lua:92: in function </usr/local/share/lua/5.3/org/conman/nfl/tls.lua:91> (...tail calls...) GLV-1.12556.lua:206: in upvalue 'reply' GLV-1.12556.lua:399: in function <GLV-1.12556.lua:232>
These series of requests exist to test Gemini client programs and since I patched the memory leak, the program has always crashed after test #8 (first clue). This test sees if the client can properly handle relative links:
Title: Link-full path, parent directory, no text Status: 20 Content-Type: text/gemini At this point, if you are writing a client that manipulates URLs directly, you will need to look at RFC-3896, section 5.2.4 to handle these relative paths correctly. Things are about to get really messy. If you want to skip ahead to the MIME portion, select the second link instead of the first. => /test/../test/torture/0009 => 0011
Heck, I used two different clients (one I wrote, and the Gemini Gateway), but I was never able to reproduce the error (second clue) so I never did bother to look deeper at this. It was only when I had the stack traces was I able to track the issue down to this bit of code (there's a bit more to see the context but the actual bug is in the last block of code below):
if not loc.host then log(ios,59,"",reply(ios,"59\t",MSG[59],"\r\n")) ios:close() return end if loc.scheme ~= 'gemini' or loc.host ~= CONF.network.host or loc.port ~= CONF.network.port then log(ios,59,"",reply(ios,"59\t",MSG[59],"\r\n")) ios:close() return end -- --------------------------------------------------------------- -- Relative path resolution is the domain of the client, not the -- server. So reject any requests with relative path elements. -- --------------------------------------------------------------- if loc.path:match "/%.%./" or loc.path:match "/%./" then log(ios,59,"",reply(ios,"59\t",MSG[59],"\r\n")) ios:close() end
These are error paths, and for each one, I close the connection and return from the handler, except for the case of a relative path!
Head, meet desk.
The clues I had were pointing right at the problem,
but I just couldn't see it.
The reason I couldn't reproduce the issue is that the two clients I used were implemented properly,
so of course the check for a relative path wouldn't trigger.
And the reason why I never found this before was because the memory leak prevented the __gc()
method from running,
which meant the TLS context pointer would never be overwritten with NULL
,
thus,
no crash.
Fixing that bug revealed this bug—one that has existed since I originally wrote the code.
A test by hand
(by running ./openssl s_client -connect gemini.conman.org:19651
and typing in a request with a relative path)
quickly revealed I had found the root cause of the crash.
Now,
it might occur to some to ask why didn't I write the Ltls_close()
function like this:
static int Ltls_close(lua_State *L) { struct tls **luaL_checkudata(L,1,TYPE_TLS); if (*tls == NULL) { lua_pushinteger(TLS_ERROR); return 1; } int rc = tls_close(*(struct tls **)luaL_checkudata(L,1,TYPE_TLS)); if ((rc != TLS_WANT_POLLIN) && (rc != TLS_WANT_POLLOUT)) Ltls___gc(L); lua_pushinteger(L,rc); return 1; }
That is—to have the function check for a NULL
pointer?
Or all the routines in the TLS module to check for a NULL
pointer?
That's a conscience decision on my part because defensive code hides real bugs
(and in this case,
it most definitely would have hidden the bug).
I came to this conclusion having read Writing Solid Code years ago
(and it was one of two books that radically changed how I approach programming—the other one being
Thinking Forth; it's well worth reading even if you don't program in Forth).
The book presents a solid case why defensive code can hide bugs much better than I can.
And even if it took a week to track down the bug,
it was worth it in the end.
Saturday, May 09, 2020
It's out!
About 50 months ago [It's only been 51 days. —Editor] [What? Really? Only 51 days? It feels longer. —Sean] [Yes, only 51 days. I know the COVID-19 thing has warped your sense of time but get a grip on yourself. —Editor] [Sigh. —Sean] 51 days ago, I reported a bug with Firefox, which took about a week to fix but wasn't slated to be released until version 76 of Firefox. Well, I just downloaded version 76 and yes, the bug is truly fixed.
Huzzah!
Now if only the Oligarchist Cell Phone Company we work with at The Corporation could work as fast.
Thursday, May 14, 2020
The shaving of yaks
A day ago,
a month ago
(time no longer has real meaning anymore),
I was ask to look into moving our code away from SVN and into
git
.
Ever since I've been pretty much busy with the shaving of yaks—lots and lots of yaks.
And it's more than just converting a repository of code from SVN to git
—it's also breaking up what is basically
a monorepos into lots of separate repos for reasons
(operations hates having to checkout the 800 pound gorilla for the 4 oz. banana).
And without losing history if at all possible.
Lots of yaks to shave in this project.
So I've been learning about submodules in git
and while I like git
,
I'm not a fan of the submodule.
First,
when you clone a repository with submodules
(git clone https://git.example.com/foo.git
)
you don't get the submodules.
That's another two steps to get them
(git submodule init; git submodule update
).
I solved that issue with a new Makefile
target:
getmod: git submodule init git submodule update
And on the build system, I made sure that make getmod
was done prior to make
.
That issue solved.
Another issue—we use Lua in our department, but we ended up with using two different versions. The stuff I built is still using 5.1.5, while a new project a fellow cow-orker wrote used 5.3.2. This is a chance to consolidate some dependencies, and to that end, I have a new single Lua repo with every version of Lua from 5.1.4 to 5.3.5 (with all the patches applied). But as I found out (and many yaks were shaved to bring me this information) you can't just checkout a particular tag or branch for a submodule.
Grrrrrr.
So I solved that issue in the Makefile
as well:
VLUA := $(shell cd deps/lua ; git describe --tag) ifneq ($(VLUA),"5.1.5.p2") DUMMY := $(shell cd deps/lua; git checkout 5.1.5.p2) endif
This checks to see which version of Lua is checked out,
and if it's not the one I want,
check out the version I do want.
I also had to make sure the .gitmodule
file had ignore = all
set:
[submodule "deps/lua"] path = deps/lua url = https://git.example.com/lua.git ignore = all
to prevent git status
from freaking people out.
I figure this will keep me busy for the next few months at least. Or maybe days. Time just flows strangely these days.
Saturday, May 16, 2020
A busy day
It's a busy day today.
First off,
the power shut off long enough for the computers to be shut down but short enough that maybe the UPSs would not have drained completely.
Sigh.
And 10 minutes after the power is restored,
I've already blocked half a dozen IP addresses trying to log in via ssh
.
I also received three bug reports for my Motorola 6809 emulator. This is the first time in eight years I've received a bug report for that particular project. It's nice to know that someone found the project useful.
I"ve also become aware of a potential email issue with my server that probably started when I last had issues with email. Briefly, when my server (a virtual server which I think is part of the cause) makes an outgoing connection, the other end sees the IP address of the physical host my server is on, not the IP address assigned to my server. This has some serious implications with email and I need to get it resolved. Sigh.
I'm also planning tomorrow's D&D session. We're also going to try using Zoom for hosting the session as Roll20 is not quite stable nor fast to use. Roll20 also seems to suck up all the bandwidth here at Chez Boca which I find annoying.
And I finally checked—it's only been 61 days since the quarantine has started. Just two months. Man, it feels like it's been forever.
Update on Wednesday, June 10TH, 2020
I solved the issue with email. Finally!