Monday, March 13, 2023
Notes on optimizing an O(n)+C algorithm where the C matters quite a bit
[Note: all hexadecimal values are preceeded with a “$”, which is old-school. I know these days it's hip to use “0x” to designate a hexadecimal value, but I hope this is just a passing fad. Also, the “K” here stands for “kilobyte,” which is 1024 bytes. This too, is old fasioned but I don't want to use a word that sounds like pet food. Geeze, kids these days!] [Here's an onion—put it on your belt, you geezer! —The kids] [Hey! I resemble that remark! Get off my lawn! —Sean]
I was doing a bit of retro computing over the weekend, writing 6809 code and running it on a Color Computer emulator (because the Color Computer was my first computer and the 6809 is a criminially underrated 8-bit CPU in my opinion). Part of the coding was enabling all 64K of RAM in the machine. Normally, the Color Computer only sees the lower 32K of RAM, with the upper 32K being ROM (the BASIC interpreter). To enable all 64K of RAM, all that's needed is to stuff any value into memory location $FFDF, which can be done with “POKE &HFFDF,0”. The problem with that is once the ROM goes away, so does BASIC, and the CPU starts executing Lord knows what since the RAM isn't initialized. So the actual procedure is to copy the ROM contents into RAM, which is simple enough:
orcc #$50 ; disable interrupts ldx #$8000 ; start of ROM loop sta $FFDE ; enable ROM mapping lda ,x ; read byte sta $FFDF ; enable RAM sta ,x+ ; write byte back, increment pointer cmpx #$FF00 ; are we done? bne loop andcc #$AF ; enable interrupts
We don't actually want to copy the full 32K of ROM, since the upper 256 bytes of the memory map is for I/O devices. We also disable interrupts since the default interrupt handlers are in ROM and if an interrupt happens when we have RAM mapped, there may not be an interrupt handler to handle it.
The code is straightforward, simple, and unfortunately, slow. Here's the main loop again, this time with the number of cycles and bytes each instruction takes:
; cycles bytes loop1 sta $FFDE ; 5 3 lda ,x ; 4 2 sta $FFDF ; 5 3 sta ,x+ ; 6 2 cmpx #$FF00 ; 4 3 bne loop1 ; 3 2
The loop is 15 bytes, taking 27 cycles for each iteration. Since we copy one byte per iteration and we have 35,512 iterations, it takes 877,824 cycles to run (there are no pipe lines or caches to worry about, so this will always take 877,824 cycles to run). Given the Color Computer runs at .89 MHz (yes, it's not a fast computer) this is nearly a second to copy 32K of RAM.
Can we do better?
Well, yes. Should we is another matter. I mean, this code is typically run once, so does it matter if it takes a second to run? Meh. It'll take longer to load the code from disk, but hey, I'm doing recreational retro programming and want to have a bit of fun.
The first and easiest optimization is to read and write 16 bits at a time instead of 8. And that's easy enough to do—the 6809 does have a 16-bit accumulator so it's an easy change:
; cycles bytes loop2 sta $FFDE ; 5 3 ldd ,x ; 5 2 sta $FFDF ; 5 3 std ,x++ ; 8 2 cmpx #$FF00 ; 4 3 bne loop2 ; 3 2
The loop now takes 30 cycles per iteration, but it's doing 16-bits per iteration instead of 8. The code now takes 487,680 cycles, which is almost half the time, and we've cut the cycles per byte to 15 from 27, and the size of the code hasn't changed. Not bad for a simple optimization.
But doing 4 bytes per iteration should be better, right? It'll take another register (which we have) and some additional bytes, but yes, it is better:
; cycles bytes loop3 sta $FFDE ; 5 3 ldd ,x ; 5 2 ldu 2,x ; 6 2 sta $FFDF ; 5 3 std ,x++ ; 8 2 stu ,x++ ; 8 2 cmpx #$FF00 ; 4 3 bne loop3 ; 3 2
Each iteration now takes 44 cycles,
but we're moving 4 bytes per iteration,
giving us 11 cycles per byte,
and it takes a total of 357,632 cycles.
Again an improvement but we're starting to hit diminishing improvements.
Doing 6 bytes per iteration
(still easy to add) takes us down to 9.833 cycles per byte,
and 8 bytes per iteration takes us down to 9.24 cycles per byte,
but we require the use of the S register
(the system stack pointer,
which needs to be saved and restored)
to do so.
It's not worth trying to use the last register available to us
because you can't load it directly.
The loop also increases in size,
from 19 bytes (shown above) to 23 bytes for the 8-bytes-per-iteration version.
the upper address will have to change for the 6 byte version,
since 6 doesn't cleanly divide 32,512.
It's not a show stopper,
since not all the memory in the upper 32K contains useful code,
so a few bytes not copied won't crash BASIC.
But we can still do better!
Taking inspiration from “A Great Old-Timey Game-Programming Hack” we can copy 8 bytes per iteration (first, the commented code, then the code with the cycles/bytes columns):
orcc #$50 ; disable interrupts sts savesp ; we need the S register lds #$FF00-8; and because we're using the stack, ; we need to start at the top of ROM ; and work our way down loop4 sta $FFDE ; enable ROM mapping puls u,x,y,d ; pull 4 2-byte registers from memory (read memory) sta $FFDF ; enable RAM pshs u,x,y,d ; push 4 2-byte registers to memory (write memory) leas -8,s ; point to next 8-byte block to transfer cmps #$8000-8; are we done? bne loop4 lds savesp ; restore S register andcc #$AF ; enable interrupts
PULS instruction pulls the listed registers off the stack,
PSHS instruction pushes the listed registers onto the stack.
The order of registers in the instructions doesn't matter; they get pushed and pulled such that it all works out.
; cycles bytes loop4 sta $FFDE ; 5 3 puls u,x,y,d ; 13 2 sta $FFDF ; 5 3 pshs u,x,y,d ; 13 2 leas -8,s ; 5 2 cmps #$8000-8; 5 4 bne loop4 ; 3 2
The main loop is now 18 bytes,
so it's on par with the third version.
But each iteration now takes 49 cycles,
but given it moves 8 bytes per iteration,
we get an effective rate of 6.125 cycles per byte,
and a total time of 199,136 cycles.
We could do nine bytes per iteration
PULS instructions support the
to get us down to 5.666 cycles per byte (184,235 cycles total).
We could also add in the
CC register for a total of 10 bytes per iteration
(5.3 cycles per byte; 172,314 cycles total)
but we run the risk of enabling interrupts at the wrong time,
unless we disable interrupts at the hardware level
(which we can do, but it's more code and more invasive of system state).
You can see we'd be getting into diminishing returns again,
so I'm happy with just doing 8 bytes per iteration.
|loop||cycles/iteration||cycles/byte||cycles total||bytes/iteration||code size|
Is this important these days? Not really. Is it fun? Yes, it certainly beats Enterprise Agile any day of the week.