Sunday, May 24, 2015
The crazy things that were done to make games run fast in the 8-bit era
The most amout of time spent in my simple graphics program is this loop:
clrloop ldd ,u++ ; load 2 bytes from background, increment pointer to next 2 bytes std ,x++ ; store them in current frame, increment pointer to next 2 bytes cmpu #end ; are we done yet? bne clrloop ; nope, keep going
Despite its name, it's not clearing memory. It's actually copying the background image to the current frame being drawn. I'm not showing the code before or after this as this post is really about this loop.
As written, each iteration of this loop takes 24 clock cycles (or just “cycles”) to run, meaning this code effectively copies one byte every 12 cycles. I recalled reading several years ago a crazy scheme to copy memory on the Motorola 6809 (the CPU used in the Color Computer) that involved using the stack register.
But before I get crazy, just how fast can I get the code to run?
Unrolling the loop a bit:
clrloop ldd ,u++ ; load 2 bytes from background std ,x++ ; store 2 bytes to current frame ldd ,u++ ; repeat this seven more times std ,x++ ldd ,u++ std ,x++ ldd ,u++ std ,x++ ldd ,u++ std ,x++ ldd ,u++ std ,x++ ldd ,u++ std ,x++ ldd ,u++ std ,x++ cmpu #end ; are we there yet? bne clrloop ; don't make me turn this CPU around!
and we get 8.5 cycles per byte.
Unrolling it more isn't worth it,
as the fastest we'll get is 8 cycles per byte
(assuming we unroll the entire loop to copy all 1,024 bytes;
but in doing so we'll use 4K in code (ldd ,u++
and std ,x++
are both two byte instructions) just to copy 1K in data).
Can we do better?
In checking the timings of various index operations, amazingly enough, adding an offset instead of just incrmenting the pointers is faster. This:
clrloop ldd ,u ; load 2 bytes from background std ,x ; store 2 bytes to current frame ldd 2,u ; load 2 more from background past the previous data std 2,x ; store them past the previous data ldd 4,u ; and keep this up std 4,x ldd 6,u std 6,x ldd 8,u std 8,x ldd 10,u std 10,x ldd 12,u std 12,x ldd 14,u std 14,x leau 16,u ; adjust pointers by 16 bytes leax 16,x ; as that's how much we copied cmpu #end ; are we there yet? bne clrloop ; enough!
gets us down to 6.5 cycles per byte! But unrolling this any further won't buy us a thing, as once the index passes 15, the instruction takes longer to execute because of additional instruction decoding. So this routine is pretty much it as far as a straightforward approach will take us. Not so bad though—almost twice as fast as the original 4 instruction loop. But to go even faster, we have to get crazy and bring in the stack pointer.
Why the stack pointer?
Because of four instructions:
PSHS
,
PSHU
,
PULS
PULU
.
The first instruction can save a number of registers onto the stack.
But looking at it another way: it's an instruction that can write up to 12 bytes into memory.
The second instruction is similar,
but instead of using the stack register,
it uses the U
register
(it's the “user stack pointer”).
The data written goes from higher addresses to lower addresses
(because traditionally,
stacks grow downward in memory).
The last two instructions to the reverse,
restoring a number of regsters from memory,
or,
reading up to 12 bytes into registers.
But we can't use all the possible registers these instruction support.
We can't use the program counter as that's rather important to executing the program
(it contains the address of the currently executing instruction—overwriting that will cause the program to start running who knows what).
We'll be using both stack registers so those are out.
We could use the CC
register,
but part of its use is to control the CPU—setting random values could be interesting.
Too interesting for me,
so that's out
(and it's only 8-bits—not like a great loss).
That still leaves us with four other registers we can use:
X
, Y
, D
(16-bit registers) and DP
(an 8-bit register).
So realistically,
we can transfer up to seven bytes at a time,
taking 12 cycles to read and 12 cycles to write,
for a theoretical maximum of 3.4 cycles per byte!
Woot!
The problem with this method is that the stack pointer is used by the CPU to keep track of where it is in the program.
Not only that,
but when the CPU receives an interrupt
(a signal that somehing has happened and needs to be handled now!)
it saves what it is doing on the stack and handles the interrupt.
And while on the 6809 the stack register can be used as a general purpose index register
(like we've been using the X
and U
registers)
we're hampered by the fact that interrupts happen.
In this case though, it can be done. The stack grows downward in memory—that is, as items are pushed onto the stack, the stack pointer is decremented lower and lower into memory. Taking this into account means we will be filling in the frame backwards, from the bottom of the frame towards the top. So we set the stack pointer to the bottom of the frame. If an interrupt happens, sure, there's some odd stuff written to the frame, but once the interrupt has been handled, the stack pointer is restored to were it was before the interrupt and we can continue copying the background, overwriting the garbage data added by the interrupt.
But pulling data off a stack goes from lower addresses to higher. And the bytes are pulled off in reverse order they were pushed (as expected—it's a stack after all—last in, first out). So the background data would need to be rearranged to take into account that we're reading the data from a low to high, storing the data high to low, every seven bytes are reversed, and that the memory in front of a frame can be expected to be trashed by an interrupt. Then there's the issue that 7 does not evenly divide 1,024—we'll have to handle the last two bytes as a special case.
But assuming the data is stored correctly and we have some memory in front of each frame that can be safely transhed, then the copying code would be:
clrloop pulu dp,d,y,x ; transfer seven bytes pshs x,y,d,dp cmpu #end-2 ; are we there yet? bne clrloop ; shut up! pulu d ; some post loop cleanup pshs d
and get 4.5 cycles per byte.
But this level of optimization should only be done if absolutely required. And in my case, it's not required (yet—if it ever will be). I'll be happy to stick with the 6.5 cycle version for now.