Friday, March 01, 2013
Parsing—it's not just for compilers anymore
I've been playing around with LuaRocks and while I've made a rock of all my modules, I've been thinking that it would be better if I made the modules individual rocks. That way, you can install just the modules you want (perhaps you want to embed a C compiler in your Lua program) instead of a bunch of modules most of which you won't use.
And that's fine. But I like the ability to pull the source code right
out of the repository when making a rock. Now, given that the majority of
my modules are single files (either in Lua or C) and the fact that it's difficult to
checkout a single file with git
(or with svn
for
that matter) I think I'd be better served having each module be its own
repository.
And that's fine, but now I have a larger problem—how do I break out the individual files into their own repositories and keep the existing revision history? This doesn't seem to be an easy problem to solve.
Sure, git
now has the concept of
“submodules”—external repositories referenced in an existing repository,
but that doesn't help me here (and git
's handling of
“submodules” is quirky at best). There's git-filter-branch
but that's if I want to break a directory into its own repository, not a
single file. But there's also git-fast-export
, which dumps an
existing repository in a text format, supposedly to help export repositories
into other version control systems.
I think I can work with this.
The resulting output is
simple and easy to parse, so my thought is to only look at bits
involving the file I'm interested in, and generating a new file that can
then be imported into a fresh resposity with
git-fast-import
.
I used LPeg to parse the
exported output (why not? The git
export format is documented
with BNF, which is directly
translatable into Lpeg), and the only difficult portion was handling this
bit of syntax:
'data' SP <count> LF <raw> LF?
A datablock contains the number of bytes to read starting with the next line. Defining this in LPeg took some thinking. An early approach was something like:
data = Ct( -- return parse results in table P'data ' -- match 'data' SP * Cg(R"09"^1,'size') -- get size, save for later reference * P'\n' -- match LF * Cg( -- named capture P(tonumber(Cb('size'))) -- of 'size' bytes characters ,'data' -- store as 'data' ) * P'\n'^-1 -- parse optional LF )
lpeg.P(n)
states that it matchs n
characters,
but in my case, n
wasn't constant. You can do named captures,
so I figured I could capture the size, then retrieve it by name, passing the
value to lpeg.P()
, but no, that didn't work. It generates
“bad argument #1 to 'P' (lpeg-pattern expected, got nil)”—in other
words, an error.
It took quite a bit of playing around, and close reading of the LPeg manual before I found the solution:
function immdata(subject,position,capture) local size = tonumber(capture) local range = position + size - 1 local data = subject:sub(position,range) return range,data end data = Ct( P'data ' * Cg(Cmt(R"09"^1 * P"\n",immdata),'data') * P'\n^-1 )
It's the lpeg.Cmt()
that does it. It calls the given
function as soon as the given pattern is matched. The function is given the
entire object being parsed (one huge string, in this case the
subject
parameter), the position after the match (the
position
parameter), and the actual string that was matched
(the capture
parameter). From there, we can parse the size
(tonumber()
, a standard Lua functionm, ignores the included
line feed character), then we return what we want as the capture (the
variable amount of data) and the new position where LPeg should resume
parsing.
And this was the hardest part of the entire project, trying to
match a variable number of unknown characters. Once I had this, I could
read the exported respository into memory, find the parts relating to an
individual file and generate output that had the history of that one file
(excluding the bits where the file may have moved from directory to
directory—those wheren't needed) which could then be imported into a clean
git
repository.