## Friday, March 01, 2024

### The speed of Microsoft's BASIC floating point routines

I was curious about how fast Microsoft's BASIC floating point routines were.
This is easy enough to test,
now that I can time assembly code inside the assembler.
The code calculates -2π^{3}/3! using Color BASIC routines,
IEEE-754 single precision and double precision.

First, Color BASIC:

.tron timing ms_fp ldx #.tau jsr CB.FP0fx ; FP0 = .tau ldx #.tau jsr CB.FMULx ; FP0 = FP0 * .tau ldx #.tau jsr CB.FMULx ; FP0 = FP0 * .tau jsr CB.FP1f0 ; FP1 = FP0 ldx #.fact3 jsr CB.FP0fx ; FP0 = 3! jsr CB.FDIV ; FP0 = FP1 / FP0 neg CB.fp0sgn ; FP0 = -FP0 ldx #.answer jsr CB.xfFP0 ; .answer = FP0 .troff rts .tau fcb $83,$49,$0F,$DA,$A2 .fact3 fcb $83,$40,$00,$00,$00 .answer rmb 5 fcb $86,$A5,$5D,$E7,$30 ; precalculated result

I can't use the `.FLOAT`

directive here since that only supports either the Microsoft format or IEEE-754 but not both.
So for this test,
I have to define the individual bytes per float.
The last line is what the result should be
(by checking a memory dump of the VM after running).
Also,
`.tao`

is 2π,
just in case that wasn't clear.
This ran in 8,742 cycles,
taking 2,124 instructions and 4.12 cycles per instruction
(I modified the assembler to record this additional information).

Next up, IEEE-754 single precision:

.tron timing ieee_single ldu #.tau ldy #.tau ldx #.answer ldd #.fpcb jsr REG fcb FMUL ; .answer = .tau * .tau ldu #.tau ldy #.answer ldx #.answer ldd #.fpcb jsr REG fcb FMUL ; .answer = .answer * .tau ldu #.answer ldy #.fact3 ldx #.answer ldd #.fpcb jsr REG fcb FDIV ; .answer = .answer / 3! ldy #.answer ldx #.answer ldd #.fpcb jsr REG fcb FNEG ; .answer = -.answer .troff rts .fpcb fcb FPCTL.single | FPCTL.rn | FPCTL.proj fcb 0 fcb 0 fcb 0 fdb 0 .tau .float 6.283185307 .fact3 .float 3! .answer .float 0 .float -(6.283185307 ** 3 / 3!)

The floating point control block (`.fpcb`

) configures the MC6839 to use single precision,
normal rounding and projective closure
(not sure what that is,
but it's the default value).
And it does calculate the correct result.
It's amazing that code written 42 *years* ago for an 8-bit CPU works flawlessly.
What it *isn't* is fast.
This code took 14,204 cycles over 2,932 instructions (average 4.84 cycles per instruction).

The higher than average cycle type could be due to position independent addressing modes,
but I'm not entirely sure what it's doing to take nearly twice the time.
The ROM does use the IEEE-754 extended format (10 bytes) internally,
with more bit shifts to extract the exponent and mantissa,
but *twice* the time?

Perhaps it's code to deal with ±∞ and NaNs.

The IEEE-754 double precision is the same,
except for the floating point control block configuring double precision and the use of `.FLOATD`

instead of `.FLOAT`

;
otherwise the code is identical.
The result,
however,
isn't.
It took 31,613 cycles over 6,865 instructions (average 4.60 cycles per instruction).
And being twice the size,
it took nearly twice the time as single precision,
which is expected.

The final bit of code just loads the ROMs into memory, and calls each function to get the timing:

org $2000 incbin "mc6839.rom" REG equ $203D ; register-based entry point org $A000 incbin "bas12.rom" .opt test prot rw,$00,$FF ; Direct Page for BASIC .opt test prot rx,$2000,$2000+8192 ; MC6839 ROM .opt test prot rx,$A000,$A000+8192 ; BASIC ROM .test "BASIC" lbsr ms_fp rts .endtst .test "IEEE-SINGLE" lbsr ieee_single rts .endtst .test "IEEE-DOUBLE" lbsr ieee_double rts .endtst

Really, the only surprising thing here was just how fast Microsoft BASIC was at floating point.