Chip's scancode shenanigans

escher · 2017-11-14 23:02

In Chip's VGA high res text driver, he builds a routine to perform the shifting and waitviding of the line buffer into a region of memory labeled "scancode", then jmps to it when necessary for execution. What I don't understand is how this is faster... you still have to take each 32-bit operation and fetch, decode, execute, etc. So besides saving room by putting it all into memory space instead of explicitly defining the entire unrolled loop, I don't see the benefit:

                        'Build scanbuff display routine into scancode 
                                                                                        
:waitvid                mov     scancode+0,i0           'org     scancode
:shr                    mov     scancode+1,i1           'waitvid color,scanbuff+0
                        add     :waitvid,d1             'shr     scanbuff+0,#8
                        add     :shr,d1                 'waitvid color,scanbuff+1
                        add     i0,#1                   'shr     scanbuff+1,#8
                        add     i1,d0                   '...
                        djnz    scan_ctr,#:waitvid      'waitvid color,scanbuff+cols-1
                            
                        mov     scancode+cols*2-1,i2    'mov     vscl,#hf
                        mov     scancode+cols*2+0,i3    'waitvid hvsync,#0
                        mov     scancode+cols*2+1,i4    'jmp     #scanret

What am I missing here?

Electrodude · 2017-11-14 23:55

If it were all written out manually, keeping every iteration consistent with the others would be very difficult and error-prone.

On a better assembler with for loop macros, a for loop macro would be used to generate all of this at compile time. But, since Spin/PASM has no macros, it has to be done at run time.

PaulForgey · 2017-11-15 00:05

I've been in a similar situation. It is a pain and error prone to have copy-n-paste-n-patch assembly in an unrolled loop and depending how many iterations to unroll, a lot of space. I shaved off 1K from my program by doing something similar, which in 32K of program space is a lot. Macro assembly would be nice, but for my case, I'd still prefer to save the memory.

rogloh · 2017-11-15 00:07

You are missing the fact that this approach saves 256 longs (1kB) of hub RAM space in the size of any executable that uses it, which is not an insignificant amount. It also only ever has to be executed once at the start, so the runtime overhead of code generation is minimal. Even the space taken up by these code creation instructions get reclaimed so there is no effective instruction space overhead for them in the COG, though there still is in the hub.

potatohead · 2017-11-15 00:21

Unrolled code ends up in COG RAM, not HUB RAM.

Big savings, small driver HUB footprint.

It's all about the initial HUB image. Keeping that small can matter.

Electrodude · 2017-11-15 00:47

Excellent point. I didn't consider that.

escher · 2017-11-15 03:08

Thanks for the feedback guys.

Correct me if I'm wrong here, but wouldn't performing the scancode routine in a closed loop only add 4 cycles (for a successful DJNZ jump) per iteration? Is the insinuation here that that 5 nanoseconds means the difference between an on-time and de-synced waitvid?

potatohead · 2017-11-15 04:06

Add up the time, subtract from your available scanline time to know. Or try it.

Electrodude · 2017-11-15 04:14

(4 cycles/instruction)/80 MHz = 50 nanoseconds/instruction, not 5 nanoseconds/instruction - you lost a zero.

You can't just add the djnz - you also have to insert an add to increment the pointer in the waitvid, and another add to increment the pointer in the shr that shifts the font data to the next line of the font. That adds up to 12 cycles/iteration.

A resolution of 1024*768*60 Hz has a pixel rate of 60 MHz according to Chip's VGA_HighRes_Text.spin. 80MHz / ((65M pixels/second) / (16 pixels/waitvid) = 21 ticks/waitvid.

A waitvid and a shr take 10 cycles, which leaves 11 cycles to spare. The 12 cycles for the djnz and the two adds won't fit.

escher · 2017-11-15 05:16

Electrodude wrote: »

(4 cycles/instruction)/80 MHz = 50 nanoseconds/instruction, not 5 nanoseconds/instruction - you lost a zero.

You can't just add the djnz - you also have to insert an add to increment the pointer in the waitvid, and another add to increment the pointer in the shr that shifts the font data to the next line of the font. That adds up to 12 cycles/iteration.

A resolution of 1024*768*60 Hz has a pixel rate of 60 MHz according to Chip's VGA_HighRes_Text.spin. 80MHz / ((65M pixels/second) / (16 pixels/waitvid) = 21 ticks/waitvid.

A waitvid and a shr take 10 cycles, which leaves 11 cycles to spare. The 12 cycles for the djnz and the two adds won't fit.

Well there you go, the (somehow) dropped factor of 10 was throwing me off; thanks for the analysis!

potatohead · 2017-11-15 22:00

As a general observation, when you find it unrolled in a video driver, it's a pretty safe assumption it needs to be.

Chip's scancode shenanigans

Comments