All PASM2 gurus - help optimizing a text driver over DVI?

evanh · 2019-10-15 04:12

Oh, here's a quote from the docs. I've highlighted what I think might be incorrect.

FAST BLOCK MOVES
By preceding RDLONG with either SETQ or SETQ2, multiple hub RAM longs can be read into either cog register RAM or cog lookup RAM. This transfer happens at the rate of one long per clock, assuming the hub FIFO interface is not accessing the same hub RAM slice as RDLONG, on the same cycle.

To me, that is saying there is two buses to each cog that can both be filled on every clock cycle as long as the addresses aren't in the same hub slice. One for the FIFO and one for the data instructions.

Is there really two buses per cog?

rogloh · 2019-10-15 04:15

cgracey wrote: »

Bit 1 in the 32-bit streamer output selects literal mode, where the three 10-bit fields above bit 1 become serial channel data. If bit 1 is 0, the top three bytes are encoded as colors, each becoming 10 serial bits.

Is this true for both LUT lookups AND immediates RGB24 mode as well? I will check this but I seem to recall the DVI crashes if I have random data in the LUTRAM but not if I have random data in the HUB longs when it is sending out RGB24 data, i.e. when in this streamer mode:

1011 dddd eppp 0110   -        RFLONG -> 24-pin + RGB24 (8:8:8) 32 out

If it is LUT RAM modes only where we have to be careful then that might be managed when reading in the palette, though it adds more overhead to go clear the lower 8 bits, instead of a simple setq2+rdlong approach to load it up, especially in 256 colour mode.

Roger.

evanh · 2019-10-15 04:23

rogloh wrote: »

Is this true for both LUT lookups AND immediates RGB24 mode as well? I will check this but I seem to recall the DVI crashes if I have random data in the LUTRAM but not if I have random data in the HUB longs when it is sending out RGB24 data, i.e. when in this streamer mode:

It's the data going into the Cmod hardware. So LUT data for LUT modes, hub data for immediate modes. The 24-bit mode presumably fills the low 8 bits with zeros.

rogloh · 2019-10-15 04:40

evanh wrote: »

The 24-bit mode presumably fills the low 8 bits with zeros.

That's what I think must be (thankfully) happening. I can throw random data into hub and stream from it in this mode and doesn't lockup. LUT is behaving differently. I guess this LUT raw mode feature is good too as it could allow canned sequences of TERC4 mapped data to be stored there and sent out, though that was not how I did it in my earlier work, but perhaps I could in the future. You just have to be careful with your palette data.

rogloh · 2019-10-15 06:44

If anyone has any good ideas and code snippets for doing a universal mouse sprite routine under all colour depths, that would be great. I have the two text mode cursors done and working, plus added some skew per scanline to allow some limited horizontal scrolling/panning now in text/graphics modes. To make it sort of feature complete as a driver for DVI the main thing left now need is a mouse sprite, and put in some sync status update which is trivial. Then it is pretty much done and fully usable...

Ideally I would get some mouse sprite code that fits into less than say 50 longs or so, excluding the space it needs for the sprite data and scanline data which can be reclaimed from other areas. I have a few ideas already to try to make it work over all bit depths and fit in COGRAM and run fast enough. I currently have a pointer to a mouse image data address in hub memory and know its x,y hotspots (0-15) for a 16x16 pixel style of mouse, it's X,Y co-ordinates, plus I know the scanline I am on of course.

I want to run it during the later part of horizontal blanking back porch as I have cycles there and it will allow the most time for HyperRAM data to make it in after being requested during the scanline earlier. The good thing is that I'd expect the FIFO won't need to be accessing HUB RAM at this time so the setq transfers probably gets full bandwidth. Here's the general approach I was considering:

1) read in the scanline data at an offset according the mouse's x position (minus x hotspot). Probably just read in 16 longs from HUB every time no matter what colour depth, simply using 16 each time is faster than figuring out how much you need for that mode and the burst is fast. Read into COG or LUT.

2) using a pointer to the mouse data and knowing the Y value, compute the scanline data address in hub and read in up to 17 longs (1 long per 32bpp pixel, plus a separate long containing mask values for the 16 pixels). This data could go into LUT or COGRAM, depending what is faster. Each mouse source data "row" could hold 16 pixels and a mask long, perhaps putting the mask first is best. A multiply operation could be used to determine the mouse data source address based on the depth and scanline offset.

3) using the mouse's pixel mask, "slide" along the source scanline data and apply the mouse data to the source data, probably using muxq and a shifting mask. The amount it shifts depends on the bit depth. You'd be advancing in bits/nits/nibbles/bytes/words/longs, and rotating the mask and advancing the source data accordingly.

4) write back the affected 16 longs to hub, then later reading for streaming out over DVI.

It's the whole step 3 "sliding" and muxing part that is key here and needs to be figured out and work for all bit depths. Ideally it could be made to apply universally. I haven't considered any pixel doubling at this stage, that would only complicate things further, so perhaps if pixel doubling is in effect and the original image was only 320 pixels wide and has been doubled to 640, the supplied mouse image data needs to be designed accordingly and become 8 "fat" pixels wide overlayed onto 16 real pixels. Or it could keep its size and look skinny over the top of the doubled pixels on the screen. Less concerned about that but if that can be solved too that would be good. The are plenty of clocks to do this as the back porch can be over 100 pixels of time, but the COG RAM space is becoming tight unless I switch to hub exec or dynamically read in code which I am trying to avoid for stability reasons.

If anyone has any ideas or tight helpful algorithms I'm happy to take input and make use of it. This DVI code is going to be released to all of us, so now's the chance to contribute to some of it. I've been on this work for many nights now and while it is close to being done, I still want that mouse feature to make it in, that would complete it. I'm feeling a bit fatigued now and lack of sleep must be catching up. It's been fun at times but can be somewhat mentally draining when you are tired.

rogloh · 2019-10-15 08:34

Here's a bit more of an idea of what I was considering for the sliding concept for a mouse sprite. This code is incomplete and doesn't work right for different offset cases but is intended to get people thinking about what might be possible...good thing is that if this type of approach is taken, I think the final code can be kept very small in COGRAM which is what I need. This sample loop takes 354 clocks or ~36 pixels which fits in the back porch. There will be more to it though by the end...

        'ptra points to source scanline image data in LUT, already read from HUB
        'ptrb points to mouse image data in LUT, read from HUB for this scanline
        'mousemask is 16 bits of mouse data pixel mask, also read from HUB
        'mask is a bitfield mask, dependent on colour depth
                ' $00000001 for 1bpp
                ' $00000003 for 2bpp
                ' $0000000F for 4bpp
                ' $000000FF for 8bpp
                ' $0000FFFF for 16bpp
                ' $FFFFFFFF for 24/32bpp
        'bpp is a value indicating bit depth        
                '1 for 1bpp
                '2 for 2bpp
                '4 for 4bpp
                '8 for 8bpp
               '16 for 16bpp
               '32 for 32bpp
        'a, mouseimg are general purpose regs

        rep     @endloop, #16    ' repeat for 16 pixels
        rdlut   a, ptra          ' a is original pixel data from scanline
        rdlut   mouseimg, ptrb   ' get mouse data pixel(s)
        shr     mousemask, #1 wc ' get next pixel in mouse pixel mask, 1=pixel set, 0=transparent
 if_c   setq    mask             ' apply current pixel position to muxq mask
 if_nc  setq    #0
        muxq    a, mouseimg      ' select original pixel or mouse pixel
        rol     mask, bpp wc     ' advance bitmask by 1,2,4,8,16,32
        wrlut   a, ptra          ' write back data
 if_c   add     ptra,#1          ' if c is set, time to move to next source pixel long
 if_c   add     ptrb,#1          ' if c is set, time to move to next mouse image long
endloop

ozpropdev · 2019-10-15 12:17

Roger
Here's a simple way to get your masks from the bpp value

	mov	mask,bpp
	sub	mask,#1
	bmask	mask

cgracey · 2019-10-15 13:26

TonyB_ wrote: »

cgracey wrote: »

FIFO depth was reduced for 8 cogs.

Thanks for the clarification, Chip. The 64-byte RDFAST/WRFAST block size was kept at the 16-cog level when FIFO was reduced. My very late night post not helping much if at all, apologies.

That's right. The block granularity is 64 bytes, or 16 longs, as this assures software compatibility, but not timing consistency, between future chips.

TonyB_ · 2019-10-15 13:26

ozpropdev wrote: »
Roger
Here's a simple way to get your masks from the bpp value
	mov	mask,bpp
	sub	mask,#1
	bmask	mask

Here's a simpler way to get bpp from mask

	mov	bpp,mask
	ones	bpp

How often is bmask used?

cgracey · 2019-10-15 13:27

evanh wrote: »

Oh, here's a quote from the docs. I've highlighted what I think might be incorrect.

FAST BLOCK MOVES
By preceding RDLONG with either SETQ or SETQ2, multiple hub RAM longs can be read into either cog register RAM or cog lookup RAM. This transfer happens at the rate of one long per clock, assuming the hub FIFO interface is not accessing the same hub RAM slice as RDLONG, on the same cycle.

To me, that is saying there is two buses to each cog that can both be filled on every clock cycle as long as the addresses aren't in the same hub slice. One for the FIFO and one for the data instructions.

Is there really two buses per cog?

There's just one bus. The FIFO gets priority over block r/w's.

cgracey · 2019-10-15 13:32

rogloh wrote: »
cgracey wrote: »

Bit 1 in the 32-bit streamer output selects literal mode, where the three 10-bit fields above bit 1 become serial channel data. If bit 1 is 0, the top three bytes are encoded as colors, each becoming 10 serial bits.

Is this true for both LUT lookups AND immediates RGB24 mode as well? I will check this but I seem to recall the DVI crashes if I have random data in the LUTRAM but not if I have random data in the HUB longs when it is sending out RGB24 data, i.e. when in this streamer mode:
1011 dddd eppp 0110   -        RFLONG -> 24-pin + RGB24 (8:8:8) 32 out  
If it is LUT RAM modes only where we have to be careful then that might be managed when reading in the palette, though it adds more overhead to go clear the lower 8 bits, instead of a simple setq2+rdlong approach to load it up, especially in 256 colour mode.

Roger.

Always, bit 1 controls literal vs color encoding. In the RGB24 mode, bits 7:0 are always cleared in the data, so bit 1 will be 0.

evanh · 2019-10-15 14:01

cgracey wrote: »

There's just one bus. The FIFO gets priority over block r/w's.

Reading the update, I now understand why you've retained the wording of it depending on the hubram slice. The extra description about the block move having to wait for next rotation helps a lot.

The "egg-beater" certainly isn't an easy thing to describe.

TonyB_ · 2019-10-15 14:14

cgracey wrote: »

In the RGB24 mode, bits 7:0 are always cleared in the data, so bit 1 will be 0.

Does "always cleared in the data" mean "must be cleared by the user"? Could bits 7:0 be don't care if HDMI mode is not enabled?

evanh · 2019-10-15 14:23

It means that streamer hardware, for that mode, always clears those bits in the data being streamed.

cgracey · 2019-10-15 14:55

TonyB_ wrote: »

cgracey wrote: »

In the RGB24 mode, bits 7:0 are always cleared in the data, so bit 1 will be 0.

Does "always cleared in the data" mean "must be cleared by the user"? Could bits 7:0 be don't care if HDMI mode is not enabled?

Maybe "always clear in the data" would be better. Those bits are masked off.

jmg · 2019-10-15 20:54

rogloh wrote: »

... the expected lack of bandwidth from the egg-beater accesses with the streamer/fifo reading out pixels sequentially at different bit depths and competing with a setq burst transfer at the same time.
For 24bpp (32 bit accesses from the streamer every 10th clock) I was thinking it might be something like 1 time in 10 that it is held up, but it may not be as simple as this, and if the FIFO has precedence perhaps you lose 8 or 16 clocks at a time, every time it lines up with the setq slice being requested or something like that.

This quickly gets complex, as there are slot and fifo interactions.

This is how I think they interact :

Taking the /10 example, the streamer output will average 10 clocks, but the steamer read-in is actually slightly different and is FIFO smoothed.
Because the streamer reads +1 address each time, it must take a slot mostly every 9 sysclocks, and will skip to 17 sysclks as needed to average exactly 10 read-out rate.
eg 9*9+17 = 98, close to the 100 empty average, but slightly too fast, so another +17 will be inserted sometimes (~ 2%).

With the streamer taking a slot every 9, the phase of the block access becomes important. If it collides, the block-read-INC stalls and it is deferred +8 for the next slot, but +9 is mostly taken, so it can only read one slot per 9 because of the 'lane blocking ' effect of streamer.
However, when the streamer hits the +17 FIFO rate skip, the block can do a full run (I think 8 deep fifo from above?),
While the streamer is doing (roughly)
9*9+17 = 98
the block fifo is doing this
9*1+8 = 17 longs in 98 sysclks. (note a 16 deep fifo could have been useful here, allowing 9*1+16=25 longs in 98 sysclks)

rogloh · 2019-10-15 21:04

Wait, jmg are you saying I will only effectively get around ~17% transfer throughput in 98 clocks from my setq burst in 32bpp mode? That could be a real deal breaker for performance.

I am kind of hoping I am misunderstanding you.

jmg · 2019-10-15 21:17

rogloh wrote: »

Wait, jmg are you saying I will only effectively get around ~17% transfer throughput in 98 clocks from my setq burst in 32bpp mode? That could be a real deal breaker for performance.

I am kind of hoping I am misunderstanding you.

You would need to test this, but yes, that is my understanding.
If you are using both streamer and block moves on the same BUS, the fact they both need sequential memory access does tend to lock them together, once they bump into each other.

rogloh · 2019-10-15 21:20

I could hopefully add a waitx between each burst transfer of a small/odd number of clocks to assist perhaps, or start on odd addresses?

rogloh · 2019-10-15 21:26

When I say "deal breaker", I don't mean this DVI version by the way. That is already working in all modes with the existing approach meaning it's most probably operating somewhere within a 6400 clock cycle budget and can get things done in time before the next hsync is due. Not sure exactly where it completes yet but I'll be measuring it soon to get a handle on the actual margins.

jmg · 2019-10-15 22:11

rogloh wrote: »

I could hopefully add a waitx between each burst transfer of a small/odd number of clocks to assist perhaps, or start on odd addresses?

Moving address could help but it's not easy to know where the streamer reads are up to, in order to avoid bumping into their slot ?
A wrapped/deferred block read could have higher bandwidths, but I don't think P2 has that in HW ? I think it is true FIFO only ?
That would defer a collided read, but continue anyway, using the spare slots to fill a buffer, until any skipped slot appears free, continuing until the last buffer update is done.
eg you could transfer [skip.2.3.4.5.6.7.8] then come back to get 1, and then flag the block as ready.

evanh · 2019-10-15 23:37

The streamer fetches data from the FIFO as it needs.
The FIFO burst reads eight longwords at a time from hubRAM.
The size of the FIFO is around 16 longwords and will be filled with a double burst upon the initialising RDFAST instruction.
Slot position (hubRAM slice) of the FIFO bursts is aligned by the specified start address. Each burst is a complete rotation.

The FIFO can reliably and efficiently handle any streamer data rate all the way to sysclock. Any remaining slot cycles is also efficiently available for that cog's data accesses.

jmg · 2019-10-15 23:42

evanh wrote: »

The streamer can fetch from the FIFO as it needs.
The FIFO burst reads eight longwords at a time from hubRAM.
The size of the FIFO is around 16 longwords and will be filled with a double burst upon the initialising RDFAST instruction.

The FIFO can reliably and efficiently handle any data rate all the way to sysclock. Any remaining slot cycles is also efficiently available for that cog's data accesses.

- but there are two fifos in play here, if you try to operate streamer and burst data at the same time.
Streamer access trumps data burst access.
I think Chip said both streamer and data fifos are 8 longs each ?

evanh · 2019-10-15 23:57

Hehe, okay, just realised "cog data instructions" still isn't narrow enough label. We are talking just about RDBYTE/RDWORD/RDLONG/WRBYTE/WRWORD/WRLONG along with the SETQ/SETQ2 bursting variants. ie: Not including the FIFO based instructions, eg: RFLONG.

There is only one FIFO and one streamer per cog. If the streamer is using the FIFO then the cog goes without. So no hubexec, and no RFLONG.

Cog bursting data instructions doesn't use the FIFO to do the burst operation.

TonyB_ · 2019-10-16 00:01

rogloh wrote: »

When I say "deal breaker", I don't mean this DVI version by the way. That is already working in all modes with the existing approach meaning it's most probably operating somewhere within a 6400 clock cycle budget and can get things done in time before the next hsync is due. Not sure exactly where it completes yet but I'll be measuring it soon to get a handle on the actual margins.

If it works, it works.

jmg wrote: »

evanh wrote: »

The streamer can fetch from the FIFO as it needs.
The FIFO burst reads eight longwords at a time from hubRAM.
The size of the FIFO is around 16 longwords and will be filled with a double burst upon the initialising RDFAST instruction.

The FIFO can reliably and efficiently handle any data rate all the way to sysclock. Any remaining slot cycles is also efficiently available for that cog's data accesses.

- but there are two fifos in play here, if you try to operate streamer and burst data at the same time.
Streamer access trumps data burst access.
I think Chip said both streamer and data fifos are 8 longs each ?

Here is how I see it.

The FIFO was reduced from 16 longs to 8 longs when 16 cogs didn't fit, otherwise I agree with Evan. Block size for RDFAST/WRFAST was kept at 64 bytes so that 16 cogs will work most easily in the future. Fast block moves with setq don't use the one-and-only FIFO because they don't need to - these transfers happen in one clock cycle, once correct slice is selected and provided FIFO is not accessing hub RAM for its burst reads/writes.

evanh · 2019-10-16 00:11

The FIFOs refill in 8 longword bursts but their size is bigger so they can initiate the burst at half empty. It'll take up to 8 clocks to get the first bursting word from hubRAM, maybe more.

EDIT: And in that interval where the FIFO is waiting for slice alignment, the cog can get access to hubRAM.

TonyB_ · 2019-10-16 00:17

evanh wrote: »

The FIFOs refill in 8 longword bursts but their size is bigger so they can initiate the burst at half empty. It'll take up to 8 clocks to get the first bursting word from hubRAM, maybe more.

EDIT: And in that interval where the FIFO is waiting for slice alignment, the cog can get access to hubRAM.

Chip re-affirmed yesterday that FIFO depth is 8 longs.

An interesting test is to set the streamer going slowly enough to write and modify some of the eight hub RAM addresses that presumably were read to fill the FIFO for the first time before the streamer has had time to output them all. Are any modified longs output?

evanh · 2019-10-16 00:22

TonyB_ wrote: »

Chip re-affirmed yesterday that FIFO depth is 8 longs.

He said they were reduced from the earlier 16-cog design. And that the burst length is now 8 longwords.

rogloh · 2019-10-16 00:40

I think I'm thankfully getting better than 17% setq throughput in 32bpp modes. It's at least over 50%.

EDIT: Seems to be ~66% when I factor out overheads.

I added a visual indicator on my screens (text and graphics modes) that tracks the point at which 3200 clocks have elapsed as a proportion of the displayed scanline. I have also added another indicator in a different colour based on the elapsed time since the streamer started sending active pixels and I begin the next scanline to when I complete the entire scanline processing.

In all cases except line doubled 16bpp and 32bpp I see we are well within budget. The line doubled mode is still unoptimized for 32bpp with my latest ideas and I was expecting that to be a lot over, which it was.

The 16bpp doubled variant is only just over budget (I just held up a ruler to the screen and checked it). It is 213mm/204mm*3200clks=3341 or ~4.5% or so over. 16bpp undoubled takes (65/204)*3200 = 1004 clocks.

The 32bpp doubled scanline takes (350/204)*3200 =5490 clocks to complete and (140/204)*3200 = 2196 clocks when not doubled.

Most of the undoubled pixel COG work is simply transferring in from one hub location and writing back to another hub location in bursts of 40 longs (20 for 1bpp mode). ie. it is primarily setq burst transfer work, with some outer loop overheads. So for 640 pixel mode in 32bpp we are doing 640 longs in and 640 longs out. That's a total of 1280 longs transferred in the overall 2196 clocks measured. Here's the loop I use below. When pixels are not doubled, carry is clear so the code below reduces to just a burst read and burst write in the loop with a few overhead cycles. There are a few other things done as well after this maybe ~14 or so other instructions, but the vast majority of the scanline is just copying data.

transferloop
            setq2   #0-0                    'block copy from HUB source to LUT
            rdlong  $140, ptra++
    if_c    mov     save_a, ptra            'preserve current pointers
    if_c    mov     save_b, ptrb            
dbl if_c    callpb  #0-0, movepixels        'double pixels
    if_c    mov     ptrb, save_b            'restore pointers
    if_c    mov     ptra, save_a
            setq2   burst                   'setup output burst for hub writes
writeback   wrlong  $0-0, ptrb++            '...and write back to line buffer
            djnz    c, #transferloop        'repeat

I can try to collect proper clock values later via some serial interface...or send it out a pin to my logic analyzer to observe it better. But I'm relieved things were not way worse than I thought at this point...still it might be hard to pixel double 32bpp in the lower budget but hopefully doable.

evanh · 2019-10-16 00:44

TonyB_ wrote: »

An interesting test is to set the streamer going slowly enough to write and modify some of the eight hub RAM addresses that presumably were read to fill the FIFO for the first time before the streamer has had time to output them all. Are any modified longs output?

RDFAST buffers those in the FIFO before the streamer gets engaged. So doesn't have to be at a slow rate to test it.

All PASM2 gurus - help optimizing a text driver over DVI?

Comments