All PASM2 gurus - help optimizing a text driver over DVI?

TonyB_ · 2019-10-13 11:16

rogloh wrote: »
I have looked further at this with Chip's diagrams and I am thinking I finally might have come across a simple sequence for the 2-bit doubling.
32 bit initial input is:     XXXXXXXX_XXXXXXXX_abcdefgh_ijklmnop 

SPLITW  a          ' result  XXXXXXXX_acegikmo_XXXXXXXX_bdfhjlnp  
MOVBYTS a, #%%2020 ' result  acegikmo_bdfhjlnp_acegikmo_bdfhjlnp
MERGEB  a          ' result  ababcdcd_efefghgh_ijijklkl_mnmnopop
Simply repeat for doing the upper 16 bits, just change the MOVBYTS value accordingly.

I think this is right, or have I messed up again...nothing surprises me with this shuffling thing, it's been doing my head in. I'll test it out.

Excellent, well done!

rogloh wrote: »

There could be something similar for replicating nibbles too. I'm almost convinced of it.

Nibble doubling without muxq is tantalising now. Probably best to work backwards using different splits, merges and movbyts.

rogloh · 2019-10-13 11:24

Nibble doubling without muxq is tantalising now. Probably best to work backwards using different splits, merges and movbyts.

Yeah I would expect that too, but the exact sequence has yet eluded me. I think it could be more steps and if it is that may become an issue. Good news is that the 2-bit doubling I found above seems to be working now and most importantly I was able to rework this new method into the earlier skipf framework nicely without violating the 7 skip maximum which would have killed it for me. Even saved a couple of COG longs too in the process. Bonus.

Here's the magic pixel doubling sequence that seems to work for all colour depths.

doublepixels
            rep     instcount, readburst
            skipf   pattern
            rdlut   a, ptra++
            setq    nibblemask
            splitw  a
            mov     b, a
            movbyts a, bytemask1
            mov     pb, a
            mergeb  a
            mergew  a
            shl     pb, #4
            muxq    a, pb
            wrlut   a, ptrb++
            movbyts b, bytemask2
            mergew  b
            mergeb  b
            mov     pb, b
            shl     pb, #4
            muxq    b, pb
            wrlut   b, ptrb++
            ret
nibblemask  long    $0ff00ff0
'                   ___skip_pattern____bytemask_inst
doublebits  long    %11110001101100110_01000100_1001
doublenits  long    %11101001110100011_10001000_1010
doublenibs  long    %00011000011000100_01010000_1110
doublebytes long    %11111001111100110_01010000_0111
doublewords long    %11111001111100110_01000100_0111
doublelongs long    %11111101111110110_00000000_0101
'bytemask2 is bytemask1^$AA when bit 0 of the skipf pattern is 0 
'bytemask2 is bytemask1^$55 when bit 0 of the skipf pattern is 1 (and then cleared to 0)

AJL · 2019-10-13 13:11

It's not a 7 skip maximum per se. 8 or more skips in a row simply means the eighth instruction is cancelled (2 cycles consumed) instead of being skipped.

I can see why you'd want to avoid it, but it shouldn't be considered a showstopper.

TonyB_ · 2019-10-13 13:19

Using immediates instead of registers for the movbyts saves longs overall and fiddling about with skip patterns and masks:

doublepixels
	rep	instcount,readburst
	skipf	pattern
	rdlut	a,ptra++
	setq	nibblemask	'4b
	splitw	a		'2b
	mov	b,a
	movbyts	a,%%2020	'2b
	movbyts	a,%%1100	'4b,8b
	movbyts	a,%%1010	'1b,16b
	mergew	a		'1b
	mergeb	a		'2b
	mov	pb,a		'4b
	shl	pb,#4		'4b
	muxq	a,pb		'4b
	wrlut	a,ptrb++
	movbyts	b,%%3131	'2b
	movbyts	b,%%3322	'4b,8b
	movbyts	b,%%3232	'1b,16b
	mergew	b		'1b
	mergeb	b		'2b
	mov	pb,b		'4b
	shl	pb,#4		'4b
	muxq	b,pb		'4b
	wrlut	b,ptrb++
	ret
	
nibblemask	long	$0ff00ff0

rogloh · 2019-10-13 13:24

I can see why you'd want to avoid it, but it shouldn't be considered a showstopper.

It's only a showstopper for me on the 16bpp pixel doubling hitting over 3200 cycles minimum once another instruction is added to the skipf loop, but it will be fine for pure DVI. I've figured out doubling 16bpp pixels with the current skipf pattern for 16bpp is already a minimum of 2880 clocks, excluding some housekeeping and loop overheads, and lost clocks from latencies during setq transfers while the streamer is operating, which I've not figured out yet - maybe @evanh would have some idea on the amount of extra cycles we need to factor in for this if we are streaming longs from the hub through the FIFO effectively sustaining long read rates of sys_clock/N while doing large setq/setq2 based sequential burst transfers from the COG at the same time, where N=10 (for 32/24bpp), N=20 (for 16bpp), N=40 (for 8bpp) etc.

                doublebits        doublenits    doublenibbles     doublebytes   doublewords     doublelongs
bpp                      1              2               4               8               16              24
clocks per rep loop     19              21              29              15              15              11
loops per 320 pix       10              20              40              80              160             320
clks for 320 pix        190             420             1160            1200            2400            3520
hub long reads          10              20              40              80              160             320
hub long writes         20              40              80              160             320             640
minimum clocks          220             480             1280            1440            2880            4480

rogloh · 2019-10-13 13:28

TonyB_ wrote: »

Using immediates instead of registers for the movbyts saves longs overall and fiddling about with skip patterns and masks:

In general yes, but in this case I think I won't be able to squeeze that many instructions in the gaps between different variants and it would go past the 7 skip limit, adding more cycles. It would be fine for depths <= 8 but not 16bpp or the 24bpp I have in there right now (probably only temporarily for truecolour as it is a bandwidth and clock cycle pig).

TonyB_ · 2019-10-13 13:39

rogloh wrote: »

TonyB_ wrote: »

Using immediates instead of registers for the movbyts saves longs overall and fiddling about with skip patterns and masks:

In general yes, but in this case I think I won't be able to squeeze that many instructions in the gaps between different variants and it would go past the 7 skip limit, adding more cycles. It would be fine for depths <= 8 but not 16bpp or the 24bpp I have in there right now (probably only temporarily for truecolour as it is a bandwidth and clock cycle pig).

My changes make no difference to 16bpp as it is placed last of the movbyts so there is still room for two more instructions without any cancelling.

TonyB_ · 2019-10-13 13:45

On a general skipping point, an instruction could be duplicated to allow for more code to be inserted and/or to shorten jumps.

rogloh · 2019-10-13 13:49

Ok, I can take a look at it if you found a working sequence that keeps things intact for 16bpp and below. I will likely pull that truecolour variant out anyway as a different case. There's always the chance of Saucy's approach for that with the LUT too though I'd sort of like everything to work the same way as special cases add more COG overhead to deal with them. With the fastest 7 clock rep loop instead of this common 11 clock version of doubling longs, it will get it down to 2240 clocks for the doubling which is a lot better than 3520! Overall it is still going to be over 3200, but there might be some extra processing time somewhere down the line (which I am sort of banking on).

TonyB_ · 2019-10-13 14:02

Adding comments to skipping code make a huge difference to understanding and is essential really. The maximum number of instructions skipped is quite easy to see (six for 8bpp and five for 16bpp in code above, with limit of seven for no wasted cycles).

rogloh · 2019-10-15 00:58

Just figured out a good way to resolve the slowness of truecolor pixel doubling to better support a 320 wide pixel mode over DVI, leaving enough cycles for other future things...

I can read the 320 wide pixel data (working on say 32 long chunks at a time) into COG RAM, then just use the following unrolled code where ptra is setup to point into LUT buffer area being written into:

wrlut cogbuf, ptra++
wrlut cogbuf, ptra++
wrlut cogbuf+1, ptra++
wrlut cogbuf+1, ptra++
wrlut cogbuf+2, ptra++
wrlut cogbuf+2, ptra++
wrlut cogbuf+3, ptra++
wrlut cogbuf+3, ptra++

I can easily build this unrolled code block dynamically once at the start and probably store it in the 64 long area I had reserved for a font buffer in text mode, plus I can use the existing 40 long character buffer for holding the 32 long source data read with repeated setq bursts.

The end result is now only 4 P2 clocks per doubled pixel multiplied by 320 = 1280 clocks, plus some outer loop overhead which will be small. This is way better than 7 clocks x 320 I had before with rdlut, wrlut, wrlut combo. Even when I add in the extra 320 clocks to read and 640 clocks to write back to HUB it will be easily less than my 3200 budget. Very happy now!

wrlut with an auto incrementing pointer is very powerful instruction for sequential transfers.

evanh · 2019-10-15 01:22

Cmod hardware needs HAM8. That'd give 18-bit colour from 8 bits per pixel. Should look as good a the 16-bit colour of truecolour.

Job for v34.

AJL · 2019-10-15 01:23

rogloh wrote: »

Just figured out a good way to resolve the slowness of truecolor pixel doubling to better support a 320 wide pixel mode over DVI, leaving enough cycles for other future things...

I can read the 320 wide pixel data (working on say 32 long chunks at a time) into COG RAM, then just use the following unrolled code where ptra is setup to point into LUT buffer area being written into:

wrlut cogbuf, ptra++
wrlut cogbuf, ptra++
wrlut cogbuf+1, ptra++
wrlut cogbuf+1, ptra++
wrlut cogbuf+2, ptra++
wrlut cogbuf+2, ptra++
wrlut cogbuf+3, ptra++
wrlut cogbuf+3, ptra++

I can easily build this unrolled code block dynamically once at the start and probably store it in the 64 long area I had reserved for a font buffer in text mode, plus I can use the existing 40 long character buffer for holding the 32 long source data read with repeated setq bursts.

The end result is now only 4 P2 clocks per doubled pixel multiplied by 320 = 1280 clocks, plus some outer loop overhead which will be small. This is way better than 7 clocks x 320 I had before with rdlut, wrlut, wrlut combo. Even when I add in the extra 320 clocks to read and 640 clocks to write back to HUB it will be easily less than my 3200 budget. Very happy now!

wrlut with an auto incrementing pointer is very powerful instruction for sequential transfers.

Is there a reason you can't use a rep block?
e.g.

   rep @.endloop, loopcount
   wrlut cogbuf, ptra++
   add cogbuf, #1
.endloop

Smaller code footprint, same execution speed.
Outer loop overhead is as simple as setting cogbuf, ptra, and loopcount.

rogloh · 2019-10-15 01:33

I considered a REP too, though the approach in your example it will effectively double the execution time, plus I think you missed doing what I want, as it just copies not replicates and increments cogbuf by 1 (but I think I know what you meant). To fit it I think you'd want something like this...

rep #4, #32
wrlut cogbuf, ptra++
add $-1, d0
wrlut cogbuf, ptra++
add $-1, d0 ' where d0 = $200 (or you could use the alti method to increment the d field instead)

It takes 8 clocks x 320 pixels = 2560 total.

Building my unrolled loop is not a problem dynamically and it won't take long and I have already had space for it because I needed a buffer for fonts which will hold this.

rogloh · 2019-10-15 01:42

evanh wrote: »

Should look as good a the 16-bit colour of truecolour.

I thought true colour was 24bpp.

https://en.wikipedia.org/wiki/Color_depth#True_color_(24-bit)

TonyB_ · 2019-10-15 01:47

deleted

evanh · 2019-10-15 01:48

Ah, common slip-up with that ADD in the REP. The cogbuf address is coded into the instruction. To make it a variable (indirection), ALTx prefixing instructions are required to be used. In this case ALTI will be the most useful.

evanh · 2019-10-15 01:51

rogloh wrote: »

I thought true colour was 24bpp.

https://en.wikipedia.org/wiki/Color_depth#True_color_(24-bit)

Oops, I was thinking high-colour. I never paid the names that much attention at the time.

Man, I hadn't imagined anyone being that hoggy on the prop2.

rogloh · 2019-10-15 01:59

evanh wrote: »

Man, I hadn't imagined anyone being that hoggy on the prop2.

LOL. I like to push things to the limit. Pixel doubling 32 bits from hub back into hub is something that does stress the P2 it seems.

While I have your attention evanh, I am wondering if you have any insights into the expected lack of bandwidth from the egg-beater accesses with the streamer/fifo reading out pixels sequentially at different bit depths and competing with a setq burst transfer at the same time.
For 24bpp (32 bit accesses from the streamer every 10th clock) I was thinking it might be something like 1 time in 10 that it is held up, but it may not be as simple as this, and if the FIFO has precedence perhaps you lose 8 or 16 clocks at a time, every time it lines up with the setq slice being requested or something like that. Based on your knowledge of access patterns, do you have anything to suggest? Maybe @cgracey would be best to ask.

evanh · 2019-10-15 02:23

rogloh wrote: »

... For 24bpp (32 bit accesses from the streamer every 10th clock) I was thinking it might be something like 1 time in 10 that it is held up, but it may not be as simple as this, and if the FIFO has predence perhaps you lose 8 or 16 clocks at a time, every time it lines up with the setq slice being requested or something like that. ...

I don't actually know but the most sense to me is the FIFO reads in whole rotations of 8 clocks at a time, possibly 16 clocks. That has priority so the hubRAM instructions have to fit in around the FIFO.

On that basis, the streamer has the FIFO bursting for 8 clocks every 80 clocks. You are wanting 40 clocks for an instruction burst. Makes it about 50/50 chance of coinciding and extending your burst by only 8 extra clocks.

evanh · 2019-10-15 02:29

If all that is correct, then the next question is what happens if the instruction burst is 80 clocks long. That's 100% coinciding. Is it always just +8 clocks or is there also a chance of +16 clocks?

AJL · 2019-10-15 02:29

rogloh wrote: »

I considered a REP too, though the approach in your example it will effectively double the execution time, plus I think you missed doing what I want, as it just copies not replicates and increments cogbuf by 1 (but I think I know what you meant). To fit it I think you'd want something like this...

rep #4, #32
wrlut cogbuf, ptra++
add $-1, d0
wrlut cogbuf, ptra++
add $-1, d0 ' where d0 = $200 (or you could use the alti method to increment the d field instead)

It takes 8 clocks x 320 pixels = 2560 total.

Building my unrolled loop is not a problem dynamically and it won't take long and I have already had space for it because I needed a buffer for fonts which will hold this.

Yes, my mistake: I missed cogbuf incrementing every second wrlut, and for some reason I also was thinking of cogbuf as a pointer to the location, rather than the location itself.

cgracey · 2019-10-15 02:30

rogloh wrote: »

evanh wrote: »

Man, I hadn't imagined anyone being that hoggy on the prop2.

LOL. I like to push things to the limit. Pixel doubling 32 bits from hub back into hub is something that does stress the P2 it seems.

While I have your attention evanh, I am wondering if you have any insights into the expected lack of bandwidth from the egg-beater accesses with the streamer/fifo reading out pixels sequentially at different bit depths and competing with a setq burst transfer at the same time.
For 24bpp (32 bit accesses from the streamer every 10th clock) I was thinking it might be something like 1 time in 10 that it is held up, but it may not be as simple as this, and if the FIFO has predence perhaps you lose 8 or 16 clocks at a time, every time it lines up with the setq slice being requested or something like that. Based on your knowledge of access patterns, do you have anything to suggest? Maybe @cgracey would be best to ask.

It makes my head swim, too. Best determined by experimentation.

evanh · 2019-10-15 02:35

cgracey wrote: »

It makes my head swim, too. Best determined by experimentation.

Ouch! That's a challenge I guess ...

rogloh · 2019-10-15 03:06

cgracey wrote: »

It makes my head swim, too. Best determined by experimentation.

Yeah I will be experimenting with it soon. I can capture the number of clocks taken to do various things and probably display it graphically in the picture on the fly to see how much headroom I have left after doubling, as a proportion of the 6400 cycles during the visible pixel portion for the different modes.

evanh wrote: »

If all that is correct, then the next question is what happens if the instruction burst is 80 clocks long. That's 100% coinciding. Is it always just +8 clocks or is there also a chance of +16 clocks?

My concern is getting into some pathological case where it hits repeatedly. I am wondering if I might be best to read things in backwards, using ptra--? Is that even possible to do with setq? Perhaps that could help too, or does that increase the time to do the transfer? I could also try to offset the initial setq request in time by some number of clocks if that helps, so I am reading odd long addresses and the streamer might be reading an even long one or something like that. Time will tell what is best.

TonyB_ · 2019-10-15 03:38

rogloh wrote: »

While I have your attention evanh, I am wondering if you have any insights into the expected lack of bandwidth from the egg-beater accesses with the streamer/fifo reading out pixels sequentially at different bit depths and competing with a setq burst transfer at the same time.
For 24bpp (32 bit accesses from the streamer every 10th clock) I was thinking it might be something like 1 time in 10 that it is held up, but it may not be as simple as this, and if the FIFO has precedence perhaps you lose 8 or 16 clocks at a time, every time it lines up with the setq slice being requested or something like that. Based on your knowledge of access patterns, do you have anything to suggest? Maybe @cgracey would be best to ask.

I think it is 16 clocks and FIFO depth is 16 longs, which it has to be for 16 cogs and it wasn't reduced for 8 cogs? Well, RDFAST/WRFAST block size wasn't reduced from 64 to 32 bytes anyway when 16 cogs wouldn't fit. What I don't know is when is the FIFO reloaded? When it's almost empty? Or half empty? Only one person knows for sure.

For 24bpp, can't hardware do the pixel repeating? A cog doesn't need to manipulate any data within each long and it seems a waste. A related question, does the low byte of RGB24 streamer data have to be zero?

Re fast block moves yielding to RDFAST streaming, none of the addresses is random for video, so if there is a sweetspot timing then it could apply all the time.

cgracey · 2019-10-15 03:38

rogloh wrote: »

cgracey wrote: »

It makes my head swim, too. Best determined by experimentation.

Yeah I will be experimenting with it soon. I can capture the number of clocks taken to do various things and probably display it graphically in the picture on the fly to see how much headroom I have left after doubling, as a proportion of the 6400 cycles during the visible pixel portion for the different modes.

evanh wrote: »

If all that is correct, then the next question is what happens if the instruction burst is 80 clocks long. That's 100% coinciding. Is it always just +8 clocks or is there also a chance of +16 clocks?

My concern is getting into some pathological case where it hits repeatedly. I am wondering if I might be best to read things in backwards, using ptra--? Is that even possible to do with setq? Perhaps that could help too, or does that increase the time to do the transfer? I could also try to offset the initial setq request in time by some number of clocks if that helps, so I am reading odd long addresses and the streamer might be reading an even long one or something like that. Time will tell what is best.

SETQ+RD/WRLONG only work in a forward direction, since that is how the RAMs cycle in the egg beater.

cgracey · 2019-10-15 03:43

TonyB_ wrote: »

rogloh wrote: »

While I have your attention evanh, I am wondering if you have any insights into the expected lack of bandwidth from the egg-beater accesses with the streamer/fifo reading out pixels sequentially at different bit depths and competing with a setq burst transfer at the same time.
For 24bpp (32 bit accesses from the streamer every 10th clock) I was thinking it might be something like 1 time in 10 that it is held up, but it may not be as simple as this, and if the FIFO has precedence perhaps you lose 8 or 16 clocks at a time, every time it lines up with the setq slice being requested or something like that. Based on your knowledge of access patterns, do you have anything to suggest? Maybe @cgracey would be best to ask.

I think it is 16 clocks as FIFO depth is 16 longs, which it has to be for 16 cogs and it wasn't reduced for 8 cogs. What I don't know is when is the FIFO reloaded? When it's almost empty? Or half empty? Only one person knows for sure.

For 24bpp, can't hardware do the pixel repeating? A cog doesn't need to manipulate any data within each long and it seems a waste. A related question, does the low byte of RGB24 longs have to be zero?

Re fast block moves yielding to RDFAST streaming, none of the addresses is random for video, so if there is a sweetspot timing then it could apply all the time.

FIFO depth was reduced for 8 cogs.

Bit 1 in the 32-bit streamer output selects literal mode, where the three 10-bit fields above bit 1 become serial channel data. If bit 1 is 0, the top three bytes are encoded as colors, each becoming 10 serial bits.

evanh · 2019-10-15 03:50

rogloh wrote: »

My concern is getting into some pathological case where it hits repeatedly. I am wondering if I might be best to read things in backwards, using ptra--? Is that even possible to do with setq? Perhaps that could help too, or does that increase the time to do the transfer?

No. It only increments. All bursts increment, per sysclock, through consecutive longword locations.

I could also try to offset the initial setq request in time by some number of clocks if that helps, so I am reading odd long addresses and the streamer might be reading an even long one or something like that. Time will tell what is best.

There are two factors, normally, just for the hub access alone. So there is variability without the FIFO involved. The FIFO's added intrusions is extra on top. Still, managing it is entirely possible simply because the start burst addresses of both the FIFO and instruction are deterministic.

Chip,
Does the FIFO refill in 8-clock bursts for an 8-cog prop2? Or is it always 16 clocks, to match the 64-byte block size? Is the initial filling different?

TonyB_ · 2019-10-15 03:59

cgracey wrote: »

FIFO depth was reduced for 8 cogs.

Thanks for the clarification, Chip. The 64-byte RDFAST/WRFAST block size was kept at the 16-cog level when FIFO was reduced. My very late night post not helping much if at all, apologies.

All PASM2 gurus - help optimizing a text driver over DVI?

Comments