The Boston Diaries

Sunday, May 24, 2015

The crazy things that were done to make games run fast in the 8-bit era

The most amout of time spent in my simple graphics program is this loop:

clrloop		ldd	,u++	; load 2 bytes from background, increment pointer to next 2 bytes
		std	,x++	; store them in current frame, increment pointer to next 2 bytes
		cmpu	#end	; are we done yet?
		bne	clrloop	; nope, keep going

Despite its name, it's not clearing memory. It's actually copying the background image to the current frame being drawn. I'm not showing the code before or after this as this post is really about this loop.

As written, each iteration of this loop takes 24 clock cycles (or just “cycles”) to run, meaning this code effectively copies one byte every 12 cycles. I recalled reading several years ago a crazy scheme to copy memory on the Motorola 6809 (the CPU used in the Color Computer) that involved using the stack register.

But before I get crazy, just how fast can I get the code to run?

Unrolling the loop a bit:

clrloop        ldd     ,u++	; load 2 bytes from background
               std     ,x++	; store 2 bytes to current frame
               ldd     ,u++	; repeat this seven more times
               std     ,x++
               ldd     ,u++
               std     ,x++
               ldd     ,u++
               std     ,x++
               ldd     ,u++
               std     ,x++
               ldd     ,u++
               std     ,x++
               ldd     ,u++ 
               std     ,x++ 
               ldd     ,u++ 
               std     ,x++ 
               cmpu    #end 	; are we there yet?
               bne     clrloop	; don't make me turn this CPU around!

and we get 8.5 cycles per byte. Unrolling it more isn't worth it, as the fastest we'll get is 8 cycles per byte (assuming we unroll the entire loop to copy all 1,024 bytes; but in doing so we'll use 4K in code (ldd ,u++ and std ,x++ are both two byte instructions) just to copy 1K in data). Can we do better?

In checking the timings of various index operations, amazingly enough, adding an offset instead of just incrmenting the pointers is faster. This:

clrloop         ldd     ,u	; load 2 bytes from background
                std     ,x	; store 2 bytes to current frame
                ldd     2,u	; load 2 more from background past the previous data
                std     2,x	; store them past the previous data
                ldd     4,u	; and keep this up
                std     4,x
                ldd     6,u
                std     6,x
                ldd     8,u
                std     8,x
                ldd     10,u
                std     10,x
                ldd     12,u
                std     12,x
                ldd     14,u
                std     14,x
                leau    16,u	; adjust pointers by 16 bytes
                leax    16,x	; as that's how much we copied
                cmpu    #end	; are we there yet?
                bne     clrloop ; enough!

gets us down to 6.5 cycles per byte! But unrolling this any further won't buy us a thing, as once the index passes 15, the instruction takes longer to execute because of additional instruction decoding. So this routine is pretty much it as far as a straightforward approach will take us. Not so bad though—almost twice as fast as the original 4 instruction loop. But to go even faster, we have to get crazy and bring in the stack pointer.

Why the stack pointer?

Because of four instructions: PSHS, PSHU, PULS PULU. The first instruction can save a number of registers onto the stack. But looking at it another way: it's an instruction that can write up to 12 bytes into memory. The second instruction is similar, but instead of using the stack register, it uses the U register (it's the “user stack pointer”). The data written goes from higher addresses to lower addresses (because traditionally, stacks grow downward in memory). The last two instructions to the reverse, restoring a number of regsters from memory, or, reading up to 12 bytes into registers.

But we can't use all the possible registers these instruction support. We can't use the program counter as that's rather important to executing the program (it contains the address of the currently executing instruction—overwriting that will cause the program to start running who knows what). We'll be using both stack registers so those are out. We could use the CC register, but part of its use is to control the CPU—setting random values could be interesting. Too interesting for me, so that's out (and it's only 8-bits—not like a great loss).

That still leaves us with four other registers we can use: X, Y, D (16-bit registers) and DP (an 8-bit register). So realistically, we can transfer up to seven bytes at a time, taking 12 cycles to read and 12 cycles to write, for a theoretical maximum of 3.4 cycles per byte!

Woot!

The problem with this method is that the stack pointer is used by the CPU to keep track of where it is in the program. Not only that, but when the CPU receives an interrupt (a signal that somehing has happened and needs to be handled now!) it saves what it is doing on the stack and handles the interrupt. And while on the 6809 the stack register can be used as a general purpose index register (like we've been using the X and U registers) we're hampered by the fact that interrupts happen.

In this case though, it can be done. The stack grows downward in memory—that is, as items are pushed onto the stack, the stack pointer is decremented lower and lower into memory. Taking this into account means we will be filling in the frame backwards, from the bottom of the frame towards the top. So we set the stack pointer to the bottom of the frame. If an interrupt happens, sure, there's some odd stuff written to the frame, but once the interrupt has been handled, the stack pointer is restored to were it was before the interrupt and we can continue copying the background, overwriting the garbage data added by the interrupt.

But pulling data off a stack goes from lower addresses to higher. And the bytes are pulled off in reverse order they were pushed (as expected—it's a stack after all—last in, first out). So the background data would need to be rearranged to take into account that we're reading the data from a low to high, storing the data high to low, every seven bytes are reversed, and that the memory in front of a frame can be expected to be trashed by an interrupt. Then there's the issue that 7 does not evenly divide 1,024—we'll have to handle the last two bytes as a special case.

But assuming the data is stored correctly and we have some memory in front of each frame that can be safely transhed, then the copying code would be:

clrloop		pulu	dp,d,y,x ; transfer seven bytes
		pshs	x,y,d,dp
		cmpu	#end-2   ; are we there yet?
		bne	clrloop  ; shut up!
		pulu	d	 ; some post loop cleanup
		pshs	d

and get 4.5 cycles per byte.

But this level of optimization should only be done if absolutely required. And in my case, it's not required (yet—if it ever will be). I'll be happy to stick with the 6.5 cycle version for now.

Sunday, May 24, 2015

The crazy things that were done to make games run fast in the 8-bit era

Obligatory Picture

Obligatory Contact Info

Obligatory Feeds

Obligatory Links

Obligatory Miscellaneous

Obligatory AI Disclaimer