Notes on optimizing an O(n)+C algorithm where the C matters quite a bit

Monday, March 13, 2023

[Note: all hexadecimal values are preceeded with a “$”, which is old-school. I know these days it's hip to use “0x” to designate a hexadecimal value, but I hope this is just a passing fad. Also, the “K” here stands for “kilobyte,” which is 1024 bytes. This too, is old fasioned but I don't want to use a word that sounds like pet food. Geeze, kids these days!] [Here's an onion—put it on your belt, you geezer! —The kids] [Hey! I resemble that remark! Get off my lawn! —Sean]

I was doing a bit of retro computing over the weekend, writing 6809 code and running it on a Color Computer emulator (because the Color Computer was my first computer and the 6809 is a criminially underrated 8-bit CPU in my opinion). Part of the coding was enabling all 64K of RAM in the machine. Normally, the Color Computer only sees the lower 32K of RAM, with the upper 32K being ROM (the BASIC interpreter). To enable all 64K of RAM, all that's needed is to stuff any value into memory location $FFDF, which can be done with “POKE &HFFDF,0”. The problem with that is once the ROM goes away, so does BASIC, and the CPU starts executing Lord knows what since the RAM isn't initialized. So the actual procedure is to copy the ROM contents into RAM, which is simple enough:

	orcc	#$50	; disable interrupts
	ldx	#$8000	; start of ROM
loop	sta	$FFDE	; enable ROM mapping
	lda	,x	; read byte
	sta	$FFDF	; enable RAM
	sta	,x+	; write byte back, increment pointer
	cmpx	#$FF00	; are we done?
	bne	loop
	andcc	#$AF	; enable interrupts

We don't actually want to copy the full 32K of ROM, since the upper 256 bytes of the memory map is for I/O devices. We also disable interrupts since the default interrupt handlers are in ROM and if an interrupt happens when we have RAM mapped, there may not be an interrupt handler to handle it.

The code is straightforward, simple, and unfortunately, slow. Here's the main loop again, this time with the number of cycles and bytes each instruction takes:

			; cycles	bytes
loop1	sta	$FFDE	; 5		3
	lda	,x	; 4		2
	sta	$FFDF	; 5		3
	sta	,x+	; 6		2
	cmpx	#$FF00	; 4		3
	bne	loop1	; 3		2

The loop is 15 bytes, taking 27 cycles for each iteration. Since we copy one byte per iteration and we have 35,512 iterations, it takes 877,824 cycles to run (there are no pipe lines or caches to worry about, so this will always take 877,824 cycles to run). Given the Color Computer runs at .89 MHz (yes, it's not a fast computer) this is nearly a second to copy 32K of RAM.

Can we do better?

Well, yes. Should we is another matter. I mean, this code is typically run once, so does it matter if it takes a second to run? Meh. It'll take longer to load the code from disk, but hey, I'm doing recreational retro programming and want to have a bit of fun.

The first and easiest optimization is to read and write 16 bits at a time instead of 8. And that's easy enough to do—the 6809 does have a 16-bit accumulator so it's an easy change:

			; cycles	bytes
loop2	sta	$FFDE	; 5		3
	ldd	,x	; 5		2
	sta	$FFDF	; 5		3
	std	,x++	; 8		2
	cmpx	#$FF00	; 4		3
	bne	loop2	; 3		2

The loop now takes 30 cycles per iteration, but it's doing 16-bits per iteration instead of 8. The code now takes 487,680 cycles, which is almost half the time, and we've cut the cycles per byte to 15 from 27, and the size of the code hasn't changed. Not bad for a simple optimization.

But doing 4 bytes per iteration should be better, right? It'll take another register (which we have) and some additional bytes, but yes, it is better:

			; cycles	bytes
loop3	sta	$FFDE	; 5		3
	ldd	,x	; 5		2
	ldu	2,x	; 6		2
	sta	$FFDF	; 5		3
	std	,x++	; 8		2
	stu	,x++	; 8		2
	cmpx	#$FF00	; 4		3
	bne	loop3	; 3		2

Each iteration now takes 44 cycles, but we're moving 4 bytes per iteration, giving us 11 cycles per byte, and it takes a total of 357,632 cycles. Again an improvement but we're starting to hit diminishing improvements. Doing 6 bytes per iteration (still easy to add) takes us down to 9.833 cycles per byte, and 8 bytes per iteration takes us down to 9.24 cycles per byte, but we require the use of the S register (the system stack pointer, which needs to be saved and restored) to do so. It's not worth trying to use the last register available to us (DP) because you can't load it directly. The loop also increases in size, from 19 bytes (shown above) to 23 bytes for the 8-bytes-per-iteration version. Also, the upper address will have to change for the 6 byte version, since 6 doesn't cleanly divide 32,512. It's not a show stopper, since not all the memory in the upper 32K contains useful code, so a few bytes not copied won't crash BASIC.

But we can still do better!

Taking inspiration from “A Great Old-Timey Game-Programming Hack” we can copy 8 bytes per iteration (first, the commented code, then the code with the cycles/bytes columns):

	orcc	#$50	; disable interrupts
	sts	savesp	; we need the S register
	lds	#$FF00-8; and because we're using the stack,
			; we need to start at the top of ROM
			; and work our way down
loop4	sta	$FFDE	; enable ROM mapping
	puls	u,x,y,d	; pull 4 2-byte registers from memory (read memory)
	sta	$FFDF	; enable RAM
	pshs	u,x,y,d	; push 4 2-byte registers to memory (write memory)
	leas	-8,s	; point to next 8-byte block to transfer
	cmps	#$8000-8; are we done?
	bne	loop4
	lds	savesp	; restore S register
	andcc	#$AF	; enable interrupts

The PULS instruction pulls the listed registers off the stack, and the PSHS instruction pushes the listed registers onto the stack. The order of registers in the instructions doesn't matter; they get pushed and pulled such that it all works out.

			; cycles	bytes
loop4	sta	$FFDE	; 5		3
	puls	u,x,y,d	; 13		2
	sta	$FFDF	; 5		3
	pshs	u,x,y,d	; 13		2
	leas	-8,s	; 5		2
	cmps	#$8000-8; 5		4
	bne	loop4	; 3		2

The main loop is now 18 bytes, so it's on par with the third version. But each iteration now takes 49 cycles, but given it moves 8 bytes per iteration, we get an effective rate of 6.125 cycles per byte, and a total time of 199,136 cycles. We could do nine bytes per iteration (as the PSHS and PULS instructions support the DP register) to get us down to 5.666 cycles per byte (184,235 cycles total). We could also add in the CC register for a total of 10 bytes per iteration (5.3 cycles per byte; 172,314 cycles total) but we run the risk of enabling interrupts at the wrong time, unless we disable interrupts at the hardware level (which we can do, but it's more code and more invasive of system state). You can see we'd be getting into diminishing returns again, so I'm happy with just doing 8 bytes per iteration.

Summary of results of copying 32,512 bytes from ROM to RAM
loop	cycles/iteration	cycles/byte	cycles total	bytes/iteration	code size
loop1	27	27.000	877,824	1	15
loop2	30	15.000	487,680	2	15
loop3	44	11.000	357,632	4	19
loop4	49	6.125	199,136	8	18
(theoretical)	53	5.300	172,314	10	18

Is this important these days? Not really. Is it fun? Yes, it certainly beats Enterprise Agile any day of the week.

The Boston Diaries

Monday, March 13, 2023

Notes on optimizing an O(n)+C algorithm where the C matters quite a bit

Discussions about this entry

Obligatory Picture

Obligatory Contact Info

Obligatory Feeds

Obligatory Links

Obligatory Miscellaneous

Obligatory AI Disclaimer