Tuesday, March 31, 2015
More musings on serializing Lua with CBOR
So, where were we?
Ah, that's right … functions!
At first sight,
it looks like it would be rather trivial.
After all,
there's a standard Lua function,
string.dump()
,
which returns a binary representation of the given function that can be reloaded using load()
.
And CBOR supports encoding arbitrary binary data.
But there are issues with this.
First,
this only works for functions written in Lua;
functions written in C cannot be dumped,
and an attempt to do so results in an error.
For now,
we'll ignore this restriction.
Second,
the binary representation returned by string.dump()
is not portable.
Up till now, everything we've been encoding with CBOR has been portable with respect to computer architectures. But serialize a Lua function, and the receiving end would have to be the same architecture running the same version of Lua or else Bad Things™ will happen.
Another approach is to send the source code.
You can use debug.getinfo()
to obtain it,
but there are two issues with this:
one,
it doesn't always work
(there are cases were Lua can't determine the source,
say, if the function was loaded using the binary representation to begin with)
and second,
the source code won't include upvalues.
And now—a digression about variable scoping in Lua to explain what “upvalues” are.
Take, for example, this contrived example:
global_x = 5 local local_x = 3 function foo(param_x) local local_foo_y = 4 local function bar(param_a) return param_x * param_a + global_x * local_x + local_foo_y end return bar end
Function bar()
references four variables defined outside of bar()
itself,
global_x
(and I'll get to global variables in a bit),
local_x
,
param_x
,
and local_foo_y
.
And because functions in Lua can be passed around and returned like any other data type,
the values of those variables outside of bar()
need to be somehow associated with bar()
,
and that's what a closure (which is what Lua uses to represent a function) does—it collects variables outside the scope of a function (it “closes over” them)
and stores them so bar()
can still use them,
even if the scope the variable was defined in (like param_x
or local_foo_y
)
no longer exists.
Such variables are called “upvalues” in Lua.
So in this example,
local_x
,
param_x
and local_foo_y
are all upvalues of bar()
.
global_x
is not an upvalue,
becuse it's a global variable,
and they're handled differently.
In Lua,
a global variable is stored in a Lua table,
and a reference to that table is stored in an upvalue automatically generated if needed
(and can always be referenced by the name _ENV
):
bar = foo(3) info = debug.getinfo(bar,"u") -- get number of upvalues for i = 1 , info.nups do name,value = debug.getupvalue(bar,i) print(name,value) end
name | value |
---|---|
param_x | 3 |
_ENV | table: 0x89433c0 |
local_x | 3 |
local_foo_y | 4 |
You can see that Lua added the _ENV
upvalue automatically.
And it's this table where you'll find global_x
.
Now, back to our regularly scheduled discussion about serializing Lua functions.
So,
if we attempt to serialize the source,
all we would get for bar()
would be:
local function bar(param_a) return param_x * param_a + global_x * local_x + local_foo_y end
The “serialized” source would need to be modified to be:
local param_x local _ENV = _ENV local local_x local local_foo_y local function bar(param_a) return param_x * param_a + global_x * local_x + local_foo_y end return bar
That should work (and we still have to serialize the upvalues) but it's untested as I only serialized the binary representation (as I have to support that anyway). But this is something I should keep in mind, though.
Sorry, I digress.
Serializing the _ENV
table is concerning.
A stock Lua global environment contains about 150 items,
mostly functions but a few values like _VERSION
(which contains a string denoting the Lua version).
Even worse,
those functions are all written in C,
which can't be serialized.
And even if the functions could be serialized,
do you really want to serialize 150 additional items?
But we can't just skip serializing _ENV
,
as it could be modified to provide a custom “global environment” for the function being serialized,
although that custom _ENV
could still refenence existing functions written in C!
What's really needed is a way to reference the data we need.
Well, CBOR does have that semantic tagging we've already used. So why not a bit more semantic tagging?
The following table:
{ print = print, tostring = tostring, tonumber = tonumber, io = io }
will basically get encoded as:
CBOR_TAG shareable CBOR_MAP 4 -- number of items in the map CBOR_TEXT "print" CBOR_TAG __Lua CBOR_TEXT "print" CBOR_TEXT "tostring" CBOR_TAG __Lua CBOR_TEXT "tostring" CBOR_TEXT "tonumber" CBOR_TAG __Lua CBOR_TEXT "tonumber" CBOR_TEXT "io" CBOR_TAG __Lua CBOR_TEXT "io"
Upon decoding, a string tagged as __Lua
will be translated to the appropriate Lua value of the given name
(which means searching through the global variables for a variable with the given name).
This solves sending the standard global environment over and it kind of,
somewhat,
sort of,
solves serializing C functions—as long as the C function exists when the function is deserialized.
Well, it's a solution, anyway.
To recap,
not only do we need to serialize the Lua function,
but its upvalues as well.
I decided to use a CBOR array for this.
The first item is the function itself,
with the rest of the items in the array being the various upvalues of the function.
And thus,
our example function bar()
is encoded:
CBOR_ARRAY 5 -- number of items in the array CBOR_BIN 1B4C7561530019930D0A... CBOR_UINT 3 CBOR_TEXT CBOR_TAG __Lua "_G" CBOR_UINT 3 CBOR_UNIT 4
There is still one slight problem though—this assumed that global_x
exists as a global variable when deserializing the function.
Unfortunately, there isn't an easy answer to this,
except “better make sure it exists.”
So that just leaves the Lua types userdata and thread left to encode …