Friday, March 27, 2015
Some musings on serializing Lua with CBOR
Because it seemed like a fun problem, I decided to try my hand at writing a serialization library for Lua. It's not like there are many to choose from or anything. And besides, how long could it take?
[Many, many hours later—too many to mention.]
Well, that was fun.
I decided to use Concise Binary Object Representation as the binary format. The specification is straightforward, and surprising for a binary format, you can include quite a bit of semantic information.
Now, Lua itself has eight basic types; a few have subtypes:
- nil
- boolean
- number
- integer
- real
- string
- table
- function
- userdata
- light userdata
- full userdata
- thread
The nil type represents nothing—i.e. the absense of a value.
CBOR has an encoding for this,
so this is easy to handle.
The boolean type has just two values—true
and false
,
and again,
CBOR has an encoding for these two values as well,
so no problem there.
Lua's number type has two subtypes integer and real
(for Lua 5.3 and above—previous versions of Lua only had one numeric type,
real as that could handle integer values up to 9,007,199,254,740,989,
way larger than the 32-bit value of 4,294,967,296,
but the rise of 64-bit systems required Lua to adapt to handle integer values up to 9,223,372,036,854,775,807).
Again, CBOR has encodings for both integers and reals.
And support for the string type is also easy.
Even better, CBOR has two formats for strings,
one for text,
and another format for arbitrary binary data.
And then there's the table.
In Lua, tables are actually hash tables. There's an internal optimization if a table is treated as a real array, where the indices are consecutive integers starting from 1. Then the items are actually stored as an array. Even worse (or better, depending on your viewpoint), a Lua table can be used as both at the same time.
For example:
my_table = { 'alpha', 'beta', 'gamma', 'delta', 'epsilon', alpha = 1, beta = 2, gamma = 3, delta = 4, epsilon = 5, pi = 3.1415926 }
The strings “alpha”, “beta”, “gamma”, “delta”, and “epsilon” are stored in the array portion,
and if you take the length of my_table
,
you'll get back 5. So while you can reference my_table[3]
(which is “gamma”)
you can also reference my_table['pi']
(which is 3.1415926).
Now CBOR has encodings for arrays and hash tables,
an ARRAY
and a MAP
,
respectively.
So we can serialize Lua tables using CBOR.
But the issue is we really can't use the CBOR ARRAY
encoding for an arbitrary Lua table,
because we can't be sure if it's just a simple array:
my_array = { 'alpha' , 'beta' , 'gamma' , 'delta' , 'epsilon' }
or a simple hash table:
my_hash = { alpha = 1 , beta = 2 , gamme = 3 , delta = 4 , epsilon = 5 , pi = 3.1415926 }
or both! (Well, we could but it would be computationally expensive)
No, the best way is to handle tables as a CBOR MAP
.
Sure,
it wastes a bit of space including the index if the table is a simple array,
but we simplify the code,
and can deal with tables like my_table
.
There's one more issue we have to deal with—circular references:
x = {} x.a = "A" -- this is a shorthand notation for x['a'] = "A" x.b = { 1 , true , false , "hello" } -- a mixed array x.c = x -- we refer to ourselves x.d = x.b -- more than once
A naïve serializer will get caught in an infinite loop the second it tries to serialize x['c']
as x['c']
refers back to itself.
A smarter serializer will keep track and realize that x['c']
is also x
,
but now we come to the problem of encoding this in CBOR.
Fortunately, this is where the semantic information of CBOR comes into play. There's a standard extention to CBOR for references. This includes additional information into the output to designate the original value and later references to it. It does complicates the encoding and decoding process a bit (as I found out).
But so far, it was fairly easy.
And then there are functions …