Saturday, September 05, 2015
Of course it's slower, but I didn't expect it to be quite that bad
Time for another useless µbenchmark! This time, the overhead of trapping integer overflow!
So,
inspired by this post about trapping integer overflow,
I thought it might be interesting to see how bad the overhead is of using the
x86 instruction
INTO
to catch integer overflow.
To do this,
I'm using DynASM to generate code from an expression that uses INTO
after every operation.
There are other ways of doing this,
but the simplist way is to use INTO
.
I'm also using 16-bit operations,
as the numbers involved
(between -32,768 and 32,767) are reasonable (for a human) to deal with
(unlike the 32-bit range -2,147,483,648 to 2147483647 or the insane 64-bit range of -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807).
The one surprising result was that Linux treats the INTO
trap as a segfault!
Even requesting additional information (passing the SA_SIGINFO
flag with sigaction()
) doesn't tell you anything.
But that in itself tells you it's not a real segfault,
as a real segfault will report a memory mapping error.
Personally,
I would have expected a floating point fault,
even though it's not a floating point operation,
because on Linux,
integer division by 0 results in floating point fault
(and oddly enough,
a floating point division by 0 results in ∞ but no fault)!
But, aside from that, some results. I basically run the expression one million times and simply record how long it takes. The first is just setting a variable to a fixed value (and the “- 0” bit is there just to ensure an overflow check is included):
overflow | time | expression result |
---|---|---|
true | 0.009080000 | 1 |
false | 0.006820000 | 1 |
Okay, not terribly bad. But how about a longer expression? (and remember, the expresssion isn't optimized)
overflow | time | expression result |
---|---|---|
true | 0.079528000 | 46 |
false | 0.030125000 | 46 |
Yikes! (But this is also including the function call overhead). For the curious, the last example compiled down to:
xor eax,eax mov ax,1 add ax,1 into add ax,1 into add ax,1 into add ax,1 into add ax,1 into imul 100 into mov bx,13 cwd idiv bx into mov [$0804f50E],ax ret
The non-overflow version just had the INTO
instructions missing—otherwise it was the same code.
I think what's surprising the most here is that the INTO
instruction just checks the overflow flag and only if set does it cause a trap.
The timings I have
(and I'll admit,
the figures I have are old and for the 80486)
show that INTO
only has a three-cycle overhead if not taken.
I'm guessing things are worse with the newer multipipelined multiscalar multiprocessor monstrosities we use these days.
Next I'll have to try using the JO
instruction and see how well that fares.