The Boston Diaries

The ongoing saga of a programmer who doesn't live in Boston, nor does he even like Boston, but yet named his weblog/journal “The Boston Diaries.”

Go figure.

Tuesday, March 31, 2015

More musings on serializing Lua with CBOR

So, where were we?

Ah, that's right … functions!

At first sight, it looks like it would be rather trivial. After all, there's a standard Lua function, string.dump(), which returns a binary representation of the given function that can be reloaded using load(). And CBOR supports encoding arbitrary binary data.

But there are issues with this. First, this only works for functions written in Lua; functions written in C cannot be dumped, and an attempt to do so results in an error. For now, we'll ignore this restriction. Second, the binary representation returned by string.dump() is not portable.

Up till now, everything we've been encoding with CBOR has been portable with respect to computer architectures. But serialize a Lua function, and the receiving end would have to be the same architecture running the same version of Lua or else Bad Things™ will happen.

Another approach is to send the source code. You can use debug.getinfo() to obtain it, but there are two issues with this: one, it doesn't always work (there are cases were Lua can't determine the source, say, if the function was loaded using the binary representation to begin with) and second, the source code won't include upvalues.

And now—a digression about variable scoping in Lua to explain what “upvalues” are.

Take, for example, this contrived example:

global_x = 5
local local_x = 3

function foo(param_x)
  local local_foo_y = 4

  local function bar(param_a)
    return param_x * param_a + global_x * local_x + local_foo_y
  end
  return bar
end

Function bar() references four variables defined outside of bar() itself, global_x (and I'll get to global variables in a bit), local_x, param_x, and local_foo_y. And because functions in Lua can be passed around and returned like any other data type, the values of those variables outside of bar() need to be somehow associated with bar(), and that's what a closure (which is what Lua uses to represent a function) does—it collects variables outside the scope of a function (it “closes over” them) and stores them so bar() can still use them, even if the scope the variable was defined in (like param_x or local_foo_y) no longer exists. Such variables are called “upvalues” in Lua.

So in this example, local_x, param_x and local_foo_y are all upvalues of bar(). global_x is not an upvalue, becuse it's a global variable, and they're handled differently. In Lua, a global variable is stored in a Lua table, and a reference to that table is stored in an upvalue automatically generated if needed (and can always be referenced by the name _ENV):

bar = foo(3)
info = debug.getinfo(bar,"u") -- get number of upvalues
for i = 1 , info.nups do
  name,value = debug.getupvalue(bar,i)
  print(name,value)
end
Upvalues for the function bar()
namevalue
param_x3
_ENVtable: 0x89433c0
local_x3
local_foo_y4

You can see that Lua added the _ENV upvalue automatically. And it's this table where you'll find global_x.

Now, back to our regularly scheduled discussion about serializing Lua functions.

So, if we attempt to serialize the source, all we would get for bar() would be:

local function bar(param_a)
  return param_x * param_a + global_x * local_x + local_foo_y
end

The “serialized” source would need to be modified to be:

local param_x
local _ENV = _ENV
local local_x
local local_foo_y

local function bar(param_a)
  return param_x * param_a + global_x * local_x + local_foo_y
end

return bar

That should work (and we still have to serialize the upvalues) but it's untested as I only serialized the binary representation (as I have to support that anyway). But this is something I should keep in mind, though.

Sorry, I digress.

Serializing the _ENV table is concerning. A stock Lua global environment contains about 150 items, mostly functions but a few values like _VERSION (which contains a string denoting the Lua version). Even worse, those functions are all written in C, which can't be serialized.

And even if the functions could be serialized, do you really want to serialize 150 additional items? But we can't just skip serializing _ENV, as it could be modified to provide a custom “global environment” for the function being serialized, although that custom _ENV could still refenence existing functions written in C!

What's really needed is a way to reference the data we need.

Well, CBOR does have that semantic tagging we've already used. So why not a bit more semantic tagging?

The following table:

{
  print    = print,
  tostring = tostring,
  tonumber = tonumber,
  io       = io
}

will basically get encoded as:

CBOR_TAG shareable CBOR_MAP 4 -- number of items in the map
	CBOR_TEXT "print"    CBOR_TAG __Lua CBOR_TEXT "print"
	CBOR_TEXT "tostring" CBOR_TAG __Lua CBOR_TEXT "tostring"
	CBOR_TEXT "tonumber" CBOR_TAG __Lua CBOR_TEXT "tonumber"
	CBOR_TEXT "io"       CBOR_TAG __Lua CBOR_TEXT "io"

Upon decoding, a string tagged as __Lua will be translated to the appropriate Lua value of the given name (which means searching through the global variables for a variable with the given name). This solves sending the standard global environment over and it kind of, somewhat, sort of, solves serializing C functions—as long as the C function exists when the function is deserialized.

Well, it's a solution, anyway.

To recap, not only do we need to serialize the Lua function, but its upvalues as well. I decided to use a CBOR array for this. The first item is the function itself, with the rest of the items in the array being the various upvalues of the function. And thus, our example function bar() is encoded:

CBOR_ARRAY 5 -- number of items in the array
	CBOR_BIN  1B4C7561530019930D0A...
	CBOR_UINT 3
	CBOR_TEXT CBOR_TAG __Lua "_G"
	CBOR_UINT 3
	CBOR_UNIT 4

There is still one slight problem though—this assumed that global_x exists as a global variable when deserializing the function. Unfortunately, there isn't an easy answer to this, except “better make sure it exists.”

So that just leaves the Lua types userdata and thread left to encode …

Obligatory Picture

[The future's so bright, I gotta wear shades]

Obligatory Contact Info

Obligatory Feeds

Obligatory Links

Obligatory Miscellaneous

You have my permission to link freely to any entry here. Go ahead, I won't bite. I promise.

The dates are the permanent links to that day's entries (or entry, if there is only one entry). The titles are the permanent links to that entry only. The format for the links are simple: Start with the base link for this site: https://boston.conman.org/, then add the date you are interested in, say 2000/08/01, so that would make the final URL:

https://boston.conman.org/2000/08/01

You can also specify the entire month by leaving off the day portion. You can even select an arbitrary portion of time.

You may also note subtle shading of the links and that's intentional: the “closer” the link is (relative to the page) the “brighter” it appears. It's an experiment in using color shading to denote the distance a link is from here. If you don't notice it, don't worry; it's not all that important.

It is assumed that every brand name, slogan, corporate name, symbol, design element, et cetera mentioned in these pages is a protected and/or trademarked entity, the sole property of its owner(s), and acknowledgement of this status is implied.

Copyright © 1999-2024 by Sean Conner. All Rights Reserved.