The Boston Diaries

The ongoing saga of a programmer who doesn't live in Boston, nor does he even like Boston, but yet named his weblog/journal “The Boston Diaries.”

Go figure.

Friday, March 27, 2015

Some musings on serializing Lua with CBOR

Because it seemed like a fun problem, I decided to try my hand at writing a serialization library for Lua. It's not like there are many to choose from or anything. And besides, how long could it take?

[Many, many hours later—too many to mention.]

Well, that was fun.

I decided to use Concise Binary Object Representation as the binary format. The specification is straightforward, and surprising for a binary format, you can include quite a bit of semantic information.

Now, Lua itself has eight basic types; a few have subtypes:

The nil type represents nothing—i.e. the absense of a value. CBOR has an encoding for this, so this is easy to handle. The boolean type has just two values—true and false, and again, CBOR has an encoding for these two values as well, so no problem there. Lua's number type has two subtypes integer and real (for Lua 5.3 and above—previous versions of Lua only had one numeric type, real as that could handle integer values up to 9,007,199,254,740,989, way larger than the 32-bit value of 4,294,967,296, but the rise of 64-bit systems required Lua to adapt to handle integer values up to 9,223,372,036,854,775,807). Again, CBOR has encodings for both integers and reals. And support for the string type is also easy. Even better, CBOR has two formats for strings, one for text, and another format for arbitrary binary data.

And then there's the table.

In Lua, tables are actually hash tables. There's an internal optimization if a table is treated as a real array, where the indices are consecutive integers starting from 1. Then the items are actually stored as an array. Even worse (or better, depending on your viewpoint), a Lua table can be used as both at the same time.

For example:

my_table = 
{
  'alpha',
  'beta',
  'gamma',
  'delta',
  'epsilon',
  alpha   = 1,
  beta    = 2,
  gamma   = 3,
  delta   = 4,
  epsilon = 5,
  pi      = 3.1415926
}

The strings “alpha”, “beta”, “gamma”, “delta”, and “epsilon” are stored in the array portion, and if you take the length of my_table, you'll get back 5. So while you can reference my_table[3] (which is “gamma”) you can also reference my_table['pi'] (which is 3.1415926).

Now CBOR has encodings for arrays and hash tables, an ARRAY and a MAP, respectively. So we can serialize Lua tables using CBOR. But the issue is we really can't use the CBOR ARRAY encoding for an arbitrary Lua table, because we can't be sure if it's just a simple array:

my_array = { 'alpha' , 'beta' , 'gamma' , 'delta' , 'epsilon' }

or a simple hash table:

my_hash = 
{ 
  alpha   = 1 , 
  beta    = 2 , 
  gamme   = 3 , 
  delta   = 4 , 
  epsilon = 5 , 
  pi      = 3.1415926
}

or both! (Well, we could but it would be computationally expensive)

No, the best way is to handle tables as a CBOR MAP. Sure, it wastes a bit of space including the index if the table is a simple array, but we simplify the code, and can deal with tables like my_table.

There's one more issue we have to deal with—circular references:

x = {}
x.a = "A" -- this is a shorthand notation for x['a'] = "A"
x.b = { 1 , true , false , "hello" } -- a mixed array
x.c = x   -- we refer to ourselves
x.d = x.b -- more than once

A naïve serializer will get caught in an infinite loop the second it tries to serialize x['c'] as x['c'] refers back to itself. A smarter serializer will keep track and realize that x['c'] is also x, but now we come to the problem of encoding this in CBOR.

Fortunately, this is where the semantic information of CBOR comes into play. There's a standard extention to CBOR for references. This includes additional information into the output to designate the original value and later references to it. It does complicates the encoding and decoding process a bit (as I found out).

But so far, it was fairly easy.

And then there are functions …

Obligatory Picture

[The future's so bright, I gotta wear shades]

Obligatory Contact Info

Obligatory Feeds

Obligatory Links

Obligatory Miscellaneous

You have my permission to link freely to any entry here. Go ahead, I won't bite. I promise.

The dates are the permanent links to that day's entries (or entry, if there is only one entry). The titles are the permanent links to that entry only. The format for the links are simple: Start with the base link for this site: https://boston.conman.org/, then add the date you are interested in, say 2000/08/01, so that would make the final URL:

https://boston.conman.org/2000/08/01

You can also specify the entire month by leaving off the day portion. You can even select an arbitrary portion of time.

You may also note subtle shading of the links and that's intentional: the “closer” the link is (relative to the page) the “brighter” it appears. It's an experiment in using color shading to denote the distance a link is from here. If you don't notice it, don't worry; it's not all that important.

It is assumed that every brand name, slogan, corporate name, symbol, design element, et cetera mentioned in these pages is a protected and/or trademarked entity, the sole property of its owner(s), and acknowledgement of this status is implied.

Copyright © 1999-2024 by Sean Conner. All Rights Reserved.