The Boston Diaries

Saturday, September 05, 2015

Some impressions of DynASM

I'm curious to test something and, odd as it may seem, the best way to do this (in my opinion) was to try using DynASM, the dynamic assembler used by LuaJIT (of which it is a part, but can be used separately from, LuaJIT). The official document is lacking somewhat, so I've been following a tutorial (along with the tutorial source code) for my own little project.

I will not be re-covering that ground here (that, and the The Unofficial DynASM Documentation should be enough to get you through using it if you are interested in it) but I will give a brief overview and my impressions of it.

DynASM is used to generate code, specified as assembly, at runtime, not at compile time. As such, you give the code you want to compile in your program thusly:

  if (token.type == TOKEN_NUMBER)
    | mov ax,token.value
  else if (token.type == TOKEN_VARIABLE)
    | mov ax,[g_vars + token.value]

All this code does is generate different code depending on if the given token is a number or a variable. The DynASM statements themselves start with a “|” (which can lead to issues if you aren't expecting it) and in this case, it's the actual assembly code we want (more assembly code can be specified, but it's limited to one assembly statement per line). Once we have written our program, the C code needs to be run through a preprocessor (the actual DynASM program itself—written in Lua) and it will generate the proper code to generate the proper machine code:

  if (token.type == TOKEN_NUMBER)
    //| mov ax,token.value
    dasm_put(Dst, 3, token.value);
#line 273 "calc.dasc"
  else if (token.type == TOKEN_VALUE)
    //| mov ax,[g_vars + token.value]
    dasm_put(Dst, 7, g_vars + token.value);

The DynASM state data, in this case, Dst, can be specified with other DynASM directives in the code. It's rather configurable. You then link against the proper runtime code (there are versions for x86, ARM, PowerPC or MIPS) and add some broiler-plate code (this is just an example of such code) and there you go.

It's an intriguing approach, and the ability to specify normal looking assembly code is a definite plus. That you have to supply different code for different CPUs is … annoying but understandable (you can get around some of this with judicious use of macros and defines but there's only so much you can hide when at one extreme, you have a CPU with only eight registers and strict memory ordering and at the other end, CPUs with 32 registers and not-so-strict memory ordering). The other thing that really bites is the use of the “|” to denote DynASM statements. Yes, it can be worked around, but why couldn't Mike Pall (author of LuaJIT) have picked a symbol not used by C for this, like “@” or “$”? Unfortunately, it is what it is.

Overall, it's rather fun to play with, and it was pretty easy to use, spotty documentation notwithstanding.

Of course it's slower, but I didn't expect it to be quite that bad

Time for another useless µbenchmark! This time, the overhead of trapping integer overflow!

So, inspired by this post about trapping integer overflow, I thought it might be interesting to see how bad the overhead is of using the x86 instruction INTO to catch integer overflow. To do this, I'm using DynASM to generate code from an expression that uses INTO after every operation. There are other ways of doing this, but the simplist way is to use INTO. I'm also using 16-bit operations, as the numbers involved (between -32,768 and 32,767) are reasonable (for a human) to deal with (unlike the 32-bit range -2,147,483,648 to 2147483647 or the insane 64-bit range of -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807).

The one surprising result was that Linux treats the INTO trap as a segfault! Even requesting additional information (passing the SA_SIGINFO flag with sigaction()) doesn't tell you anything. But that in itself tells you it's not a real segfault, as a real segfault will report a memory mapping error. Personally, I would have expected a floating point fault, even though it's not a floating point operation, because on Linux, integer division by 0 results in floating point fault (and oddly enough, a floating point division by 0 results in ∞ but no fault)!

But, aside from that, some results. I basically run the expression one million times and simply record how long it takes. The first is just setting a variable to a fixed value (and the “- 0” bit is there just to ensure an overflow check is included):

x = 1 - 0
overflow	time	expression result
true	0.009080000	1
false	0.006820000	1

Okay, not terribly bad. But how about a longer expression? (and remember, the expresssion isn't optimized)

x = 1 + 1 + 1 + 1 + 1 + 1 * 100 / 13
overflow	time	expression result
true	0.079528000	46
false	0.030125000	46

Yikes! (But this is also including the function call overhead). For the curious, the last example compiled down to:

	xor	eax,eax
	mov	ax,1
	add	ax,1
	into
	add	ax,1
	into
	add	ax,1
	into
	add	ax,1
	into
	add	ax,1
	into
	imul	100
	into
	mov	bx,13
	cwd
	idiv	bx
	into
	mov	[$0804f50E],ax
	ret

The non-overflow version just had the INTO instructions missing—otherwise it was the same code.

I think what's surprising the most here is that the INTO instruction just checks the overflow flag and only if set does it cause a trap. The timings I have (and I'll admit, the figures I have are old and for the 80486) show that INTO only has a three-cycle overhead if not taken. I'm guessing things are worse with the newer multipipelined multiscalar multiprocessor monstrosities we use these days.

Next I'll have to try using the JO instruction and see how well that fares.

When an issue requires a clone to resolve, I think the bus number is a bit too low

Over a month ago, Mike Pall announced he was leaving the LuaJIT project, which is sad, considering that LuaJIT has been pretty much his project. But he does have a sense of humor about leaving the project:

Suggested by http://www .freelists.org/post/luajit/Luajit-30-plan,2

Unclear how to find the upstream tracker to push this issue to. Will send pull request to upstream with DNA/RNA sequence, once this has been resolved.

Long-term enhancement request. No milestone assigned, yet.

Clone Mike Pall • Issue #45 • LuaJIT/LuaJIT

Saturday, September 05, 2015

Some impressions of DynASM

Of course it's slower, but I didn't expect it to be quite that bad

When an issue requires a clone to resolve, I think the bus number is a bit too low

Obligatory Picture

Obligatory Contact Info

Obligatory Feeds

Obligatory Links

Obligatory Miscellaneous

Obligatory AI Disclaimer