The Boston Diaries

The ongoing saga of a programmer who doesn't live in Boston, nor does he even like Boston, but yet named his weblog/journal “The Boston Diaries.”

Go figure.

Friday, March 01, 2013

Parsing—it's not just for compilers anymore

I've been playing around with LuaRocks and while I've made a rock of all my modules, I've been thinking that it would be better if I made the modules individual rocks. That way, you can install just the modules you want (perhaps you want to embed a C compiler in your Lua program) instead of a bunch of modules most of which you won't use.

And that's fine. But I like the ability to pull the source code right out of the repository when making a rock. Now, given that the majority of my modules are single files (either in Lua or C) and the fact that it's difficult to checkout a single file with git (or with svn for that matter) I think I'd be better served having each module be its own repository.

And that's fine, but now I have a larger problem—how do I break out the individual files into their own repositories and keep the existing revision history? This doesn't seem to be an easy problem to solve.

Sure, git now has the concept of “submodules”—external repositories referenced in an existing repository, but that doesn't help me here (and git's handling of “submodules” is quirky at best). There's git-filter-branch but that's if I want to break a directory into its own repository, not a single file. But there's also git-fast-export, which dumps an existing repository in a text format, supposedly to help export repositories into other version control systems.

I think I can work with this.

The resulting output is simple and easy to parse, so my thought is to only look at bits involving the file I'm interested in, and generating a new file that can then be imported into a fresh resposity with git-fast-import.

I used LPeg to parse the exported output (why not? The git export format is documented with BNF, which is directly translatable into Lpeg), and the only difficult portion was handling this bit of syntax:

'data' SP <count> LF
<raw> LF?

A datablock contains the number of bytes to read starting with the next line. Defining this in LPeg took some thinking. An early approach was something like:

data = Ct(				-- return parse results in table
	   P'data '			-- match 'data' SP
	   * Cg(R"09"^1,'size')		-- get size, save for later reference
	   * P'\n'			-- match LF
	   * Cg(			-- named capture
	         P(tonumber(Cb('size'))) -- of 'size' bytes characters
		 ,'data'                -- store as 'data'
	   * P'\n'^-1			-- parse optional LF

lpeg.P(n) states that it matchs n characters, but in my case, n wasn't constant. You can do named captures, so I figured I could capture the size, then retrieve it by name, passing the value to lpeg.P(), but no, that didn't work. It generates “bad argument #1 to 'P' (lpeg-pattern expected, got nil)”—in other words, an error.

It took quite a bit of playing around, and close reading of the LPeg manual before I found the solution:

function immdata(subject,position,capture)
  local size  = tonumber(capture)
  local range = position + size - 1
  local data  = subject:sub(position,range)
  return range,data

data = Ct(
	   P'data '
	   * Cg(Cmt(R"09"^1 * P"\n",immdata),'data')
	   * P'\n^-1

It's the lpeg.Cmt() that does it. It calls the given function as soon as the given pattern is matched. The function is given the entire object being parsed (one huge string, in this case the subject parameter), the position after the match (the position parameter), and the actual string that was matched (the capture parameter). From there, we can parse the size (tonumber(), a standard Lua functionm, ignores the included line feed character), then we return what we want as the capture (the variable amount of data) and the new position where LPeg should resume parsing.

And this was the hardest part of the entire project, trying to match a variable number of unknown characters. Once I had this, I could read the exported respository into memory, find the parts relating to an individual file and generate output that had the history of that one file (excluding the bits where the file may have moved from directory to directory—those wheren't needed) which could then be imported into a clean git repository.

Obligatory Picture

[“I am NOT a number, I am … a Q-CODE!”]

Obligatory Contact Info

Obligatory Feeds

Obligatory Links

Obligatory Miscellaneous

You have my permission to link freely to any entry here. Go ahead, I won't bite. I promise.

The dates are the permanent links to that day's entries (or entry, if there is only one entry). The titles are the permanent links to that entry only. The format for the links are simple: Start with the base link for this site:, then add the date you are interested in, say 2000/08/01, so that would make the final URL:

You can also specify the entire month by leaving off the day portion. You can even select an arbitrary portion of time.

You may also note subtle shading of the links and that's intentional: the “closer” the link is (relative to the page) the “brighter” it appears. It's an experiment in using color shading to denote the distance a link is from here. If you don't notice it, don't worry; it's not all that important.

It is assumed that every brand name, slogan, corporate name, symbol, design element, et cetera mentioned in these pages is a protected and/or trademarked entity, the sole property of its owner(s), and acknowledgement of this status is implied.

Copyright © 1999-2024 by Sean Conner. All Rights Reserved.