Of course it's slower, but I didn't expect it to be quite that bad

Saturday, September 05, 2015

Time for another useless µbenchmark! This time, the overhead of trapping integer overflow!

So, inspired by this post about trapping integer overflow, I thought it might be interesting to see how bad the overhead is of using the x86 instruction INTO to catch integer overflow. To do this, I'm using DynASM to generate code from an expression that uses INTO after every operation. There are other ways of doing this, but the simplist way is to use INTO. I'm also using 16-bit operations, as the numbers involved (between -32,768 and 32,767) are reasonable (for a human) to deal with (unlike the 32-bit range -2,147,483,648 to 2147483647 or the insane 64-bit range of -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807).

The one surprising result was that Linux treats the INTO trap as a segfault! Even requesting additional information (passing the SA_SIGINFO flag with sigaction()) doesn't tell you anything. But that in itself tells you it's not a real segfault, as a real segfault will report a memory mapping error. Personally, I would have expected a floating point fault, even though it's not a floating point operation, because on Linux, integer division by 0 results in floating point fault (and oddly enough, a floating point division by 0 results in ∞ but no fault)!

But, aside from that, some results. I basically run the expression one million times and simply record how long it takes. The first is just setting a variable to a fixed value (and the “- 0” bit is there just to ensure an overflow check is included):

x = 1 - 0
overflow	time	expression result
true	0.009080000	1
false	0.006820000	1

Okay, not terribly bad. But how about a longer expression? (and remember, the expresssion isn't optimized)

x = 1 + 1 + 1 + 1 + 1 + 1 * 100 / 13
overflow	time	expression result
true	0.079528000	46
false	0.030125000	46

Yikes! (But this is also including the function call overhead). For the curious, the last example compiled down to:

	xor	eax,eax
	mov	ax,1
	add	ax,1
	into
	add	ax,1
	into
	add	ax,1
	into
	add	ax,1
	into
	add	ax,1
	into
	imul	100
	into
	mov	bx,13
	cwd
	idiv	bx
	into
	mov	[$0804f50E],ax
	ret

The non-overflow version just had the INTO instructions missing—otherwise it was the same code.

I think what's surprising the most here is that the INTO instruction just checks the overflow flag and only if set does it cause a trap. The timings I have (and I'll admit, the figures I have are old and for the 80486) show that INTO only has a three-cycle overhead if not taken. I'm guessing things are worse with the newer multipipelined multiscalar multiprocessor monstrosities we use these days.

Next I'll have to try using the JO instruction and see how well that fares.

The Boston Diaries

Saturday, September 05, 2015

Of course it's slower, but I didn't expect it to be quite that bad

Obligatory Picture

Obligatory Contact Info

Obligatory Feeds

Obligatory Links

Obligatory Miscellaneous

Obligatory AI Disclaimer