All PASM2 gurus - help optimizing a text driver over DVI?

jmg · 2019-10-16 01:37

TonyB_ wrote: »

Here is how I see it.
The FIFO was reduced from 16 longs to 8 longs when 16 cogs didn't fit, otherwise I agree with Evan. Block size for RDFAST/WRFAST was kept at 64 bytes so that 16 cogs will work most easily in the future. Fast block moves with setq don't use the one-and-only FIFO because they don't need to - these transfers happen in one clock cycle, once correct slice is selected and provided FIFO is not accessing hub RAM for its burst reads/writes.

Yes, where the data is vague, is what happens when that slot is taken ?
If data travels in order, the fast block move must defer to +8, and then from there, run until the next defer, or it is done.
Q:Can fast block move transfer data out-of-order ? (data does not say so?)

jmg · 2019-10-16 01:42

rogloh wrote: »

I think I'm thankfully getting better than 17% setq throughput in 32bpp modes. It's at least over 50%.

EDIT: Seems to be ~66% when I factor out overheads.

That's good, Is that with streamer running 1 in 10 at the same time ?

evanh · 2019-10-16 01:46

The instruction data bursts can be interceded upon by the FIFO bursts, possibly multiple times in a single opcode if the instruction is copying a large number of longwords.

HubRAM burst data order for both the FIFO and instructions is always sequential - incrementing consecutive longword addresses.

PS: Minimum RDLONG execution time is the burst length + 8 clocks. So longer burst lengths have a throughput advantage within the cog. This won't limit the FIFOs access to hubRAM.

rogloh · 2019-10-16 02:23

That's good, Is that with streamer running 1 in 10 at the same time ?

Yes it's for the 32bpp mode, which averages 1 long read from the hub every 10 clocks via the streamer.

rogloh · 2019-10-16 02:52

If it takes around 1900 clocks to do the 1280 transfers in the non-pixel doubled 32bpp mode, then for a pixel doubled variant which only reads in 320 longs and writes out 640 longs I'm anticipating it will be just 75% of this, or 1425 clocks. The optimized unrolled copy routine I came up with then uses up 1280 more cycles for pixel doubling, so 1425 + 1280 = 2705 leaving ~500 clocks for any loop overheads and the time to dynamically build the 64 copy instructions in COG RAM plus per scanline housekeeping. I think we can still do it but it will be tight!
The only thing I see which might be different is that this mode I might burst transfer 32 longs at a time instead of 40 (for space reasons), which may interfere with the streamer in a different way. I guess if that doesn't work I could try to move down to 20 at a time instead. I can't do over 32 at a time as I don't have room for the unrolled instructions. To build the unrolled instruction sequence actually takes a while though.

To generate this unrolled sequence:

wrlut buffer, ptra++
wrlut buffer, ptra++
wrlut buffer+1, ptra++
wrlut buffer+1, ptra++
wrlut buffer+2, ptra++
wrlut buffer+2, ptra++
...

I think I will need something like this loop...which itself takes 320 clocks! Anyone know a faster way to do it without reading it dynamically (which I could do but is sort of cheating)?

rep #5, #32
mov 0-0, instr
add $-1, _d0
mov 1-0, instr
add $-1, _d0
add instr, _d0
...
instr wrlut buffer, ptra++
_d0 long $200

UPDATE: I've added things up for what I'd consider to be a heavily optimized case for doing this processing and get these numbers...

1) hub transfers: ~1425 clocks (anticipated based on measured values but could change)
2) pixel doubling unrolled loop setup: 320 clocks
3) pixel doubling: 1280 clocks
4) outer call overhead + scanline housekeeping: 210 clocks
Total = 3235 clocks.

Total result is just over my desired 3200 budget but may still be okay...but to put this in perspective a TERC4 packet encode of 2 audio samples plus housekeeping appears to take ~3142 clocks when I counted it up and I am allowing for 3200...

evanh · 2019-10-16 05:01

I'm measuring FIFOs to be 19 longwords deep. This will have a bearing on the high and low water marks, which in turn has bearing on which slot/slice is the refilling position. So not as simple as just the start address, but still deterministic.

jmg · 2019-10-16 05:23

evanh wrote: »

I'm measuring FIFOs to be 19 longwords deep. This will have a bearing on the high and low water marks, which in turn has bearing on which slot/slice is the refilling position. So not as simple as just the start address, but still deterministic.

Good point about high and low water marks, do you know what those are on the streamer ?
I was basing my spacing/interactions on the streamer filling, and then taking a long when available.

If it instead has hysteresis, and takes more longs, then it needs to do so much less often, and that makes a big difference to blocking effects.
eg a hyst of 8, means it reads 8 longs every 80 cycles, giving 70 free adjacent slots to any fast block move, and now impact on fast block move is much reduced.

evanh · 2019-10-16 05:54

When I talk about the streamer doing DMA ops, the FIFO is always in between. From the prop2 doc:

There are three ways the hub FIFO interface can be used, and it can only be used for one of these, at a time:
Hub execution (when the PC is $00400..$FFFFF)
Streamer usage (background transfers from hub RAM → pins/DACs, or from pins → hub RAM)
Software usage (fast sequential-reading or sequential-writing instructions)

The FIFO's 8-clock (one rotation) bursts are how the streamer gets its data. Presumably, the FIFO's high water is 19 and the low water is 11.

evanh · 2019-10-16 06:03

jmg wrote: »

eg a hyst of 8, means it reads 8 longs every 80 cycles, giving 70 free adjacent slots to any fast block move.

Correct.

ozpropdev · 2019-10-16 06:38

Roger
Maybe using register indirection can help.

	setd	ptrs,#buffer
	sets	ptrs,#0
	rep	#4,#32
	alti	ptrs,#%100_111
	wrlut	0-0,#0-0
	alti	ptrs,#%111_111
	wrlut	0-0,#0-0

evanh · 2019-10-16 07:04

That's smooth Brian.

Kind of amusing how there is 36 unused machine code bits.

rogloh · 2019-10-16 07:31

ozpropdev wrote: »
Roger
Maybe using register indirection can help.
	setd	ptrs,#buffer
	sets	ptrs,#0
	rep	#4,#32
	alti	ptrs,#%100_111
	wrlut	0-0,#0-0
	alti	ptrs,#%111_111
	wrlut	0-0,#0-0

I want to believe but I don't quite get it yet, how exactly does it move the 64 wrlut instructions into the 64 long buffer space in COG RAM for later execution? ALTI sometimes gets confusing.

evanh · 2019-10-16 08:07

The ALTI instructions entirely take over the addressing of both cogram and lutram. Register "ptrs" holds both working addresses.

Lutram addresses (S field of WRLUTs) are incremented in ptrs by both ALTI instructions, whereas the cogram addresses (D field of WRLUTs) is incremented in ptrs only by the second ALTI.

rogloh · 2019-10-16 08:09

I have some mouse sprite code coming together...it's bloating a little bit more that I'd like but I might still squeeze it in..

It's still incomplete as I still need to get the X offset computed and clip sides, accounting for X hotspot etc. Hopefully not too many more instructions...then time to test.

do_mouse
            getword a, mouse_xy, #1         'get mouse y co-ordinate
            getnib  b, mouseptr, #7         'get y hotspot (0-15)
            sub     a, b                    'compensate for it
            subr    a, scanline wc          'check if sprite is on scanline
    if_nc   cmpr    a, #15 wc               'check if sprite is on scanline
    if_c    ret     wcz                     'mouse sprite not in range, exit
            
            alts    bpp, #bittable          'bpp starts out as index from 0-5
            mov     bitmask, 0-0            'get table entry
            mul     a, bitmask              'multiply mouse row by its length
            shr     bitmask, #16 wz         'get mask portion
    if_z    not     bitmask                 'fix up the 32 bpp case
            mov     bpp, bitmask
            ones    bpp                     'convert to real bpp
            add     a, mouseptr             'add offset to base mouse address

            setq2   #17-1                   'read 17 longs max, mask is first
            rdlong  $120, a                 '..followed by mouse image data
            setq2   #16-1                   'read 16 scanline longs into LUT
            rdlong  $110, linebufaddr       ' TODO: determine X offset/clipping etc    
                
            mov     ptra, #$110             'ptra used for source image data
            mov     ptrb, #$120             'ptrb used for mouse image data
            rdlut   c, ptrb++               'the mouse mask is stored first

            mul     offset, bpp             'calculate mouse offset (0-30)
            mov     mask, bitmask           'setup mask for pixel's data size
            rol     mask, offset            'align first pixel
            rdlut   a, ptra                 'get original scanline pixel data
            testb   bitmask, #0 wc          'set c initially to read sprite in

            rep     @endmouse, #16          'repeat loop for 16 pixels
    if_c    rdlut   b, ptrb++               'get next mouse sprite pixel(s)
    if_c    rol     b, offset               'align with the source input data
            shr     c, #1 wc                'get mask bit 1=set, 0=transparent
    if_c    setq    mask                    'apply bit mask to muxq mask
    if_nc   setq    #0                      'setup a transparent mouse pixel
            muxq    a, b                    'select original or mouse pixel
            rol     mask, bpp wc            'advance mask by 1,2,4,8,16,32
    if_c    wrlut   a, ptra++               'write back data once we filled it 
    if_c    rdlut   a, ptra                 '...and read next source pixel(s)
            rol     bitmask, bpp wc         'rotate mouse data refill tracker 
endmouse
            setq2   #16-1
            wrlong  $110, linebufaddr       'write LUT image data back to hub
            ret

bittable
            long    $0001_00_08             '1bpp
            long    $0003_00_08             '2bpp
            long    $000F_00_0C             '4bpp
            long    $00FF_00_14             '8bpp
            long    $FFFF_00_24             '16bpp
            long    $0000_00_44             '32bpp 

bitmask     long    0  'TODO eventually share other register names to eliminate
bpp         long    0
offset      long    0
mask        long    0

rogloh · 2019-10-16 08:14

evanh wrote: »

The ALTI instructions entirely take over the addressing of both cogram and lutram. Register "ptrs" holds both working addresses.

Lutram addresses (S field of WRLUTs) are incremented in ptrs by both ALTI instructions, whereas the cogram addresses (D field of WRLUTs) is incremented in ptrs only by the second ALTI.

Yeah but it doesn't construct new unrolled code anywhere does it? If this is for moving data to LUT it takes 4 clocks per long vs unrolled loop of 2 clocks per long.

evanh · 2019-10-16 08:20

Correct. There is still some trade-off. Space and construction time for unrolled 2 clocks per longword vs compact 4 clocks per longword. Better than earlier loops speeds though.

rogloh · 2019-10-16 08:25

I'd like to create the unrolled loop code using indirection if it could get the loop to 4 instructions instead of 5. I wonder if that is possible. It would buy another 64 clocks.

evanh · 2019-10-16 08:30

Brian's loop is four instructions.

BTW: I don't know if it's useful but he's freed up the PTRA register too.

evanh · 2019-10-16 09:04

Three instructions per loop - Tested and working. It's very specific to the prop2 instruction pipeline because of the self-modifying code! This also means it's bound to executing from cogram only.

'copy buff to lutram with longword doubling
		setd	.dptr, #buff		'cogram address
		mov	ptra, #0		'lutram address

		rep	@.rend, #LEN
		alti	.dptr, #%111_000
		wrlut	0-0, ptra++
.dptr		wrlut	0, ptra++
.rend

rogloh · 2019-10-16 09:52

evanh wrote: »

Brian's loop is four instructions.

BTW: I don't know if it's useful but he's freed up the PTRA register too.

Brian's code copies longs from a COGRAM buffer to LUT taking 4 clocks per long with indirection, with 4 instructions in the loop. Yes it is faster than my current duplication code which is less than 6400 but still not fast enough for a 3200 cycle budget. To hit that lower timing budget I found I will need to use an unrolled loop of 64 COGRAM instructions, which then copy longs at a rate of one long per 2 clocks into LUT RAM (ie. 2x faster that Brian's code).

The problem is that the actual code to create this unrolled loop dynamically is itself taking 5 instructions to create each pair of WRLUT instructions, or 5 x 2 x 32 = 320 clocks in total. If I can shrink that creation loop down to 4 instructions somehow with indirection perhaps, it will then drop to 256 clocks and buy me a bit more of a safety buffer.

rogloh · 2019-10-16 09:55

evanh wrote: »

Three instructions per loop - Tested and working. It's very specific to the prop2 instruction pipeline because of the self-modifying code! This also means it's bound to executing from cogram only.

That's pretty clever. Can we do something like that for the unrolled loop creation code?

evanh · 2019-10-16 09:56

rogloh wrote: »

The problem is that the actual code to create this unrolled loop dynamically is itself taking 5 instructions to create each pair of WRLUT instructions, or 5 x 2 x 32 = 320 clocks in total.

Oh! Must be time to not do the construction method then. See my latest rolled up version above.

AJL · 2019-10-16 09:58

rogloh wrote: »

evanh wrote: »

Three instructions per loop - Tested and working. It's very specific to the prop2 instruction pipeline because of the self-modifying code! This also means it's bound to executing from cogram only.

That's pretty clever. Can we do something like that for the unrolled loop creation code?

I don't see how the dynamically generated unrolled loop can beat evanh's code, and you can do 40 long bursts like the other routines.

rogloh · 2019-10-16 10:01

Let's see.

Read 320 pixels into COG at 66% efficiency.
Generate 640 pixels in LUT, 3 instructions for each of the 320 source pixels = 960 instructions x 2 clocks = 1920
Write 640 pixels from LUT to hub at 66% efficiency.
320 * 3/2 + 1920 + 640 * 3/2 = 3360 cycles, not accounting for loop overheads and other things.

The unrolled version is still a bit faster.
320 * 3/2 + 1280 + 640 * 3/2 + 320 (for unrolled code loop creation) = 3040. Still 320 cycles less.

Edit: updated earlier compensation factors from 4/3 to 3/2.

ozpropdev · 2019-10-16 10:28

evanh wrote: »
Three instructions per loop - Tested and working. It's very specific to the prop2 instruction pipeline because of the self-modifying code! This also means it's bound to executing from cogram only.
'copy buff to lutram with longword doubling
		setd	.dptr, #buff		'cogram address
		mov	ptra, #0		'lutram address

		rep	@.rend, #LEN
		alti	.dptr, #%111_000
		wrlut	0-0, ptra++
.dptr		wrlut	0, ptra++
.rend

That's a nifty use of the pipeline there Evan! Very Cool

evanh · 2019-10-16 10:47

All trial and error. Lots of printing in the console.

And that's powerful partly because there is a recent history of the output I can look back on and compare what I'd got on the previous attempts.

evanh · 2019-10-16 11:08

As requested, a smaller construction loop. I've cheated a little by using lutram as the target

'construct in lutram an unrolled - copy buff to lutram with longword doubling
		setd	.instr, #buff			'reset cogram address
		mov	ptra, #(LUT_end - LUT_code)	'lutram address

		rep	@.rend, #LEN
		wrlut	.instr, ptra++			'copy instruction to lutram
		wrlut	.instr, ptra++			'  ditto, double up
		alti	.instr, #%011_000		'increment buff pixel address in D field
.rend
		...

.instr		wrlut	buff, ptra++

rogloh · 2019-10-16 11:15

Very cool evanh! Plus LUTRAM may work out ok for this special optimized 32bpp code doubler because the 256 entry palette is not going to be required for this 32bpp mode and I can reclaim it. Also it would let me use more of the COGRAM as the initial buffer area so get 40 or 80 long transfer loops. If it works, I like!

You might have just saved me about another 200 80 128? clocks.

AJL · 2019-10-16 13:36

rogloh wrote: »

Let's see.

Read 320 pixels into COG at 66% efficiency.
Generate 640 pixels in LUT, 3 instructions for each of the 320 source pixels = 960 instructions x 2 clocks = 1920
Write 640 pixels from LUT to hub at 66% efficiency.
320 * 3/2 + 1920 + 640 * 3/2 = 3360 cycles, not accounting for loop overheads and other things.

The unrolled version is still a bit faster.
320 * 3/2 + 1280 + 640 * 3/2 + 320 (for unrolled code loop creation) = 3040. Still 320 cycles less.

Edit: updated earlier compensation factors from 4/3 to 3/2.

For some reason I had it in my head that you were dynamically generating the code before each burst, rather than each line.

As you aren't likely to be switching modes from one line to the next, could you set a flag somewhere when the code is generated, and skip the generation if it is already there?

Then the generation happens once at start of frame, and it collapses to a simple flag check each line.

If you do switch modes from line to line, you regenerate as necessary, with the code that overwrites the routine clearing the flag.

Addit: Or use self-modifying code to switch between the generation routine and the generated routine as needed.

cgracey · 2019-10-16 17:32

Seeing there was some concern here about FIFO metrics, I just added this to the documentation:

The FIFO contains (cogs+11) stages. When in read mode, the FIFO loads continuously until (cogs+6) stages are filled, after which 5 more longs stream in, potentially filling all (cogs+11) stages, in case no intervening read(s) had occurred. These metrics ensure that the FIFO never underflows under all potential scenarios.

All PASM2 gurus - help optimizing a text driver over DVI?

Comments