Parsing—it's not just for compilers anymore

Friday, March 01, 2013

I've been playing around with LuaRocks and while I've made a rock of all my modules, I've been thinking that it would be better if I made the modules individual rocks. That way, you can install just the modules you want (perhaps you want to embed a C compiler in your Lua program) instead of a bunch of modules most of which you won't use.

And that's fine. But I like the ability to pull the source code right out of the repository when making a rock. Now, given that the majority of my modules are single files (either in Lua or C) and the fact that it's difficult to checkout a single file with git (or with svn for that matter) I think I'd be better served having each module be its own repository.

And that's fine, but now I have a larger problem—how do I break out the individual files into their own repositories and keep the existing revision history? This doesn't seem to be an easy problem to solve.

Sure, git now has the concept of “submodules”—external repositories referenced in an existing repository, but that doesn't help me here (and git's handling of “submodules” is quirky at best). There's git-filter-branch but that's if I want to break a directory into its own repository, not a single file. But there's also git-fast-export, which dumps an existing repository in a text format, supposedly to help export repositories into other version control systems.

I think I can work with this.

The resulting output is simple and easy to parse, so my thought is to only look at bits involving the file I'm interested in, and generating a new file that can then be imported into a fresh resposity with git-fast-import.

I used LPeg to parse the exported output (why not? The git export format is documented with BNF, which is directly translatable into Lpeg), and the only difficult portion was handling this bit of syntax:

'data' SP <count> LF
<raw> LF?

A datablock contains the number of bytes to read starting with the next line. Defining this in LPeg took some thinking. An early approach was something like:

data = Ct(				-- return parse results in table
	   P'data '			-- match 'data' SP
	   * Cg(R"09"^1,'size')		-- get size, save for later reference
	   * P'\n'			-- match LF
	   * Cg(			-- named capture
	         P(tonumber(Cb('size'))) -- of 'size' bytes characters
		 ,'data'                -- store as 'data'
	     )
	   * P'\n'^-1			-- parse optional LF
	)

lpeg.P(n) states that it matchs n characters, but in my case, n wasn't constant. You can do named captures, so I figured I could capture the size, then retrieve it by name, passing the value to lpeg.P(), but no, that didn't work. It generates “bad argument #1 to 'P' (lpeg-pattern expected, got nil)”—in other words, an error.

It took quite a bit of playing around, and close reading of the LPeg manual before I found the solution:

function immdata(subject,position,capture)
  local size  = tonumber(capture)
  local range = position + size - 1
  local data  = subject:sub(position,range)
  return range,data
end

data = Ct(
	   P'data '
	   * Cg(Cmt(R"09"^1 * P"\n",immdata),'data')
	   * P'\n^-1
	)

It's the lpeg.Cmt() that does it. It calls the given function as soon as the given pattern is matched. The function is given the entire object being parsed (one huge string, in this case the subject parameter), the position after the match (the position parameter), and the actual string that was matched (the capture parameter). From there, we can parse the size (tonumber(), a standard Lua functionm, ignores the included line feed character), then we return what we want as the capture (the variable amount of data) and the new position where LPeg should resume parsing.

And this was the hardest part of the entire project, trying to match a variable number of unknown characters. Once I had this, I could read the exported respository into memory, find the parts relating to an individual file and generate output that had the history of that one file (excluding the bits where the file may have moved from directory to directory—those wheren't needed) which could then be imported into a clean git repository.

The Boston Diaries

Friday, March 01, 2013

Parsing—it's not just for compilers anymore

Obligatory Picture

Obligatory Contact Info

Obligatory Feeds

Obligatory Links

Obligatory Miscellaneous

Obligatory AI Disclaimer