## Friday, March 27, 2015

### Some musings on serializing Lua with CBOR

Because it seemed like a fun problem, I decided to try my hand at writing a serialization library for Lua. It's not like there are many to choose from or anything. And besides, how long could it take?

[Many, many hours later—too many to mention.]

Well,
*that* was *fun*.

I decided to use Concise Binary Object Representation as the binary format. The specification is straightforward, and surprising for a binary format, you can include quite a bit of semantic information.

Now, Lua itself has eight basic types; a few have subtypes:

- nil
- boolean
- number
- integer
- real

- string
- table
- function
- userdata
- light userdata
- full userdata

- thread

The *nil* type represents nothing—i.e. the absense of a value.
CBOR has an encoding for this,
so this is easy to handle.
The *boolean* type has just two values—`true`

and `false`

,
and again,
CBOR has an encoding for these two values as well,
so no problem there.
Lua's *number* type has two subtypes *integer* and *real*
(for Lua 5.3 and above—previous versions of Lua only had one numeric type,
*real* as that could handle integer values up to 9,007,199,254,740,989,
way larger than the 32-bit value of 4,294,967,296,
but the rise of 64-bit systems required Lua to adapt to handle integer values up to 9,223,372,036,854,775,807).
Again, CBOR has encodings for both integers and reals.
And support for the *string* type is also easy.
Even better, CBOR has *two* formats for strings,
one for text,
and another format for arbitrary binary data.

And then there's the *table.*

In Lua, tables are actually hash tables.
There's an internal optimization if a table is treated as a real array,
where the indices are consecutive integers starting from 1.
Then the items are actually *stored* as an array.
Even worse
(or better,
depending on your viewpoint),
a Lua table can be used as both at the same time.

For example:

my_table = { 'alpha', 'beta', 'gamma', 'delta', 'epsilon', alpha = 1, beta = 2, gamma = 3, delta = 4, epsilon = 5, pi = 3.1415926 }

The strings “alpha”, “beta”, “gamma”, “delta”, and “epsilon” are stored in the array portion,
and if you take the length of `my_table`

,
you'll get back 5. So while you can reference `my_table[3]`

(which is “gamma”)
you can also reference `my_table['pi']`

(which is 3.1415926).

Now CBOR has encodings for arrays and hash tables,
an `ARRAY`

and a `MAP`

,
respectively.
So we can serialize Lua tables using CBOR.
But the issue is we really can't use the CBOR `ARRAY`

encoding for an arbitrary Lua table,
because we can't be sure if it's just a simple array:

my_array = { 'alpha' , 'beta' , 'gamma' , 'delta' , 'epsilon' }

or a simple hash table:

my_hash = { alpha = 1 , beta = 2 , gamme = 3 , delta = 4 , epsilon = 5 , pi = 3.1415926 }

or both! (Well, we *could* but it would be computationally expensive)

No, the best way is to handle tables as a CBOR `MAP`

.
Sure,
it wastes a bit of space including the index if the table is a simple array,
but we simplify the code,
and can deal with tables like `my_table`

.

There's one more issue we have to deal with—circular references:

x = {} x.a = "A" -- this is a shorthand notation for x['a'] = "A" x.b = { 1 , true , false , "hello" } -- a mixed array x.c = x -- we refer to ourselves x.d = x.b -- more than once

A naïve serializer will get caught in an infinite loop the second it tries to serialize `x['c']`

as `x['c']`

refers back to itself.
A smarter serializer will keep track and realize that `x['c']`

is also `x`

,
but now we come to the problem of encoding this in CBOR.

Fortunately, this is where the semantic information of CBOR comes into play. There's a standard extention to CBOR for references. This includes additional information into the output to designate the original value and later references to it. It does complicates the encoding and decoding process a bit (as I found out).

But so far, it was fairly easy.

And *then* there are functions …