A display list is fun :) [0.50a beta+0.07g - PSRAM command list used]

pik33 · 2022-05-14 07:18

@evanh said:

@pik33 said:
It can be a good idea to do window decoration out of sprites.

Tiling, I should have said tiles rather partitions earlier, doesn't need border decorations. It won't be like a desktop GUI, but is that really desired?

The current version of the Prop2Play uses this kind of pseudowindows, not movable, simple decorated, and there are 3 different pixel data sources: one for the scrolling info at the bottom, the second for the file info window, and the main buffer for the rest. As it is now, the file info window is blitted every frame to the main buffer: this takes about 5 ms, so I can still fit in the 20ms frame with all I need.
The updated driver allows to switch to the scrolling window without blitting.

And yes, I want a desktop-like GUI

pik33 · 2022-05-15 09:38

I moved the repository to gitlab.

https://gitlab.com/pik33/P2-retromachine

pik33 · 2022-05-24 07:30

A bug costed me several days to catch.

The symptom: the first line of the picture was bad: first 56 to 80 pixels was a copy of 4th line, counting from the bottom. Including any sprite that was there.

The problem was streamer's FIFO

The driver has 4 lines buffer in the hub, where it preloads data from the PSRAM. It is then used by the streamer in the RDFAST+LUT mode. When the streamer displayed the last line, the FIFO preloads several bytes from the start of the buffer which in this moment contains the line-3: nothing is preloaded because there are no lines to preload.

Then at the start of the new frame the buffer is preloaded with the first line, but the streamer ignores this, because it has these longs already in the FIFO.

The solution: init the rdfast at every frame after preloading the first line and not at the start so it can preload proper pixels into its FIFO

evanh · 2022-05-24 08:44

Yes, obvious in hindsight, FIFO is a pre-fetcher when in RDFAST mode. If the RAM has changed, usually from another cog/streamer, then the FIFO content becomes dirty.

WRFAST FIFO operation is much more aggressive with its writes to hubRAM. It will write any fresh data ASAP. This is good in one sense but excessively consumes its hubRAM bandwidth. Since the FIFO has priority, this impacts its cog's accesses to hubRAM. It took a lot of testing to work this out - On ManAtWork's prompting I think it was.

pik33 · 2022-05-24 14:04

A first window test. The window is a Basic class with its own graphic procedures, so it has its own independent PSRAM framebuffer, size, etc, including the cog which does the drawing. The video driver can place the window at any position of the screen using a display list and a PSRAM access list. To have more of them I have now to write a manager which can generate these lists on the fly based on windows positions.

While the class mostly work, there is a bug with accessing PSRAM somewhere, the window interferes with the background.

pik33 · 2022-05-24 20:56

It seems now all basic graphic stuff works in the class, so it's time to start the manager.

rogloh · 2022-05-25 02:39

This is pretty good stuff. I wonder if there will be sufficient time on the line for many overlapping windows, or just a very small number of them. Each time the request list runs from a new item, it will lose out some read bandwidth as it needs to start a new transfer which incurs overhead.

pik33 · 2022-05-25 07:29

In this case it's important to do the request list in one run until it ends, without checking another cogs and mailboxes, to save as much time as possible. I have now the driver open, but it is a complex piece of code, so I am still searching and trying to understand what happens there.

rogloh · 2022-05-25 07:59

Ideally if you mark a COG's QoS attribute as LOCKED it could complete the request list before yielding, but I'm not sure if that is the current behaviour because I created request lists mainly for client writer COGs which are typically low priority, not really for video reader COGs, so I don't think I made it work that way. It could potentially be modified to do so however.

pik33 · 2022-05-25 09:36

The test results are strange.

This is the test code:

dim psram as class using "psram4"

dim s(512)

psram.startx(0, 0, 12, -1)
let mbox=psram.getMailbox(0)

dim list1(11)

list1(0)=$B0000000+$600000                   
list1(1)=$60000
list1(2)=127
list1(3)=addr(list1(4))
list1(4)=$B0_600000+127                  
list1(5)=$60000+127
list1(6)=511
list1(7)=addr(list1(8))
list1(8)=$B0600000+511+127                   
list1(9)=$60000+511+127
list1(10)=1024-511-127
list1(11)=0


'cog1                                                '<-uncomment to run in the cog #0
let cog=cpu(cog1,@s)                   '<-uncomment to run in its own cog 


sub cog1
for thecog=0 to 7:psram.setQos(thecog, 112 << 16) :next thecog
psram.setQoS(cpuid(), $7FFFF400) 
waitms(1000)

lpoke mbox+4,addr(list1)
lpoke mbox,$FFFFFFFF
let t1=getct()
do : loop until (lpeek(mbox) and $80000000) = 0
let t1=getct()-t1: print t1

lpoke mbox+8,1024
lpoke mbox+4,$60000
lpoke mbox,$B060_0000
let t1=getct()
do : loop until (lpeek(mbox) and $80000000) = 0
let t1=getct()-t1: print t1

end sub

sub lpoke(addr as ulong,value as ulong)
asm
wrlong value, addr
end asm
end sub

function addr(byref v as const any) as ulong

return(cast(ulong,@v))
end function

function lpeek(addr) as ulong
dim r as ulong
asm
rdlong r,addr
end asm
return r
end function

Now, the code loads 1 kB using 3-element list and directly, then measure the time.
I can do it either with the main cog calling the "cog1" procedure directly, or make the procedure run in its own cog.

If it runs from the cog #0, the results are 5956, 5474
If it runs using another cog, the results are 7150, 6795

The difference between cogs is much too big, this is 1300 clocks, why?

rogloh · 2022-05-25 10:48

That's strange. The only thing I can think of is that the COG order would add a few extra clocks if the polling loop is restarted per list element but this is a very small effect compared to your numbers above. Try removing the unused COG IDs from being polled by patching the value as 0 for the QoS value, instead of 112<<16 to see if that improves things...?

rogloh · 2022-05-25 10:52

Actually I think I know why now... if I'm reading your BASIC program correctly it looks like you are using the same mailbox base address but just with a different burst size in the two tested cases. Shouldn't you be using a different mailbox address for each COG?

pik33 · 2022-05-25 11:03

Yes, that was the bug. Now it is 5956, 5517.
It is something like 250 clocks overhead for every list element. I will try to test a 15-element, 1024 bytes list... 15 elements is the maximum for 8 windows on the screen

rogloh · 2022-05-25 11:37

Good. I think the 16bit PSRAM will be useful in this application. Even if each list element adds ~1us overhead, 1024 x 576 8bpp might still be doable given your line scanning frequency is reasonably small. Your horizontal timing allows something like 28us or so per scan line doesn't it?

rogloh · 2022-05-25 11:42

By the way I wonder if 960x540 might be nice to get working vs 1024x576 and it could very cleanly scale on a full HD monitor with less artifacts.

pik33 · 2022-05-25 11:57

All less than 1024x576 can be done. The full beta 0.50 driver without sprites is (or: should be) configurable. This PSRAM sprite 0.08 thing is also configurable but in the current state it has hardcoded constants. They should and will be replaced with defined timings values: now it is still in the experimental stage. There was no place left for making sprites in 0.50, and that's why I did this new, simplified fork, which is PSRAM only, graphics only without character modes and without any horizontal hardware assisted zoom

'                      bf.hs, hs,  bf.vis  visible, up p., vsync, down p.,  cpl, total lines, clock,       hubset                                scanlines  ud bord mode reserved
timings         long   14,    80,  14,      1024,   6,     8,     6,        128, 576,         336956522,   %1_101101__11_0000_0110__1111_1011,   576,        0,     192, 0, 0

pik33 · 2022-05-25 13:11

A test 15-element list time is 8577. Still fits in the line but not much time left. However, not every line will have 15-elements list, this is the worst case for 8 windows (including a background)

I tried to decode the driver but it contains way too much skips and self modifying code to understand what happens there

rogloh · 2022-05-25 13:40

@pik33 said:
A test 15-element list time is 8577. Still fits in the line but not much time left. However, not every line will have 15-elements list, this is the worst case for 8 windows (including a background)

That still sounds good then. And 16 bit PSRAM will be even better.

I tried to decode the driver but it contains way too much skips and self modifying code to understand what happens there

Yeah to pack in all the features, it needs a lot of skipf and other tricks.

rogloh · 2022-05-26 01:47

To speed up the list process with the LOCKED attribute enabled it might be possible to do something like this...

Now I've not checked it this works or if there are nasty side effects but if it does, it could help save quite a lot of cycles per list element (potentially up to ~86 clock cycles per list item, depending on COG poll order and HUB read latency etc) because it skips repolling and the per COG setup code and gets right back to list processing directly. It requires 4 free LUT RAM longs which hopefully might still be available in the 4 bit PSRAM driver.

To try it (no guarantees), replace this block:

checknext                   tjz     addr2, #listcomplete    'not special, check if the list is complete
                            rdlong  pa, ptrb[-1]            'check if list has been aborted by client
                            tjns    pa, #listcomplete       'exit if it has
                            skipf   #%1000                  'do not notify if list is continuing
                            wrlong  addr2, ptrb             ' a  write back next list address
listcomplete                altd    id, #id0                ' a  compute COG's state address
                            bitl    0-0, #LIST_BIT          ' a  clear list flag for this COG
            _ret_           push    #notify                 ' |  we are done with the list
            _ret_           push    #poller                 ' a  we are still continuing the list

with this:

checknext                   tjz     addr2, #listcomplete    'not special, check if the list is complete
                            rdlong  pa, ptrb[-1]            'check if list has been aborted by client
                            tjns    pa, #listcomplete       'exit if it has
                            testb   id, #LOCKED_BIT wz
            if_z            mov     hubdata, addr2
            if_z            skipf   #%110001
            if_nz           skipf   #%1000                  'do not notify if list is continuing
                            wrlong  addr2, ptrb             ' a  write back next list address
listcomplete                altd    id, #id0                ' a  compute COG's state address
                            bitl    0-0, #LIST_BIT          ' a  clear list flag for this COG
            _ret_           push    #notify                 ' |  we are done with the list
            _ret_           push    #poller                 ' a  we are still continuing the list
            _ret_           push    #real_list

pik33 · 2022-05-26 07:55

Now it is 7300..7600 (it varies from run to run) : about 1000 clocks gained and this can make 8-windows system possible.
On the EC-32MB this may also allow 32-bit color graphics instead of 8-bit. Maybe I will receive EC-32MB tomorrow.

evanh · 2022-05-26 07:58

The anticipation! But you won't have anything to plug the Edge Card into.

rogloh · 2022-05-26 08:01

@pik33 said:
Now it is 7300..7600 (it varies from run to run) : about 1000 clocks gained and this can make 8-windows system possible.
On the EC-32MB this may also allow 32-bit color graphics instead of 8-bit. Maybe I will receive EC-32MB tomorrow.

That's great! I'm glad it works and saves a decent amount of clocks. I think there is some work to do on 16 bit driver to make room (I have 2 instructions free there and need 4). But there might be some scope as I have an idea...

pik33 · 2022-05-26 08:06

But you won't have anything to plug the Edge Card into

Why? I have several 1st generation Edges and breakout boards with the edge connector.

pik33 · 2022-05-26 08:08

(I have 2 instructions free there and need 4).

Maybe something non-essential can be removed from there. What makes the 16-bit driver longer?

rogloh · 2022-05-26 08:16

@pik33 said:

(I have 2 instructions free there and need 4).

Maybe something non-essential can be removed from there. What makes the 16-bit driver longer?

It's far more complex to manage due to the addressing and spanning the 4 chips vs single chip. I need to check for and implement Read/Modify/Write on all writes that don't align to the native word size.

I think there is some slight duplication of the list termination processing in LUT RAM elsewhere in this block here that may be able to share common code and this may free up the extra LUT longs needed...this code is also in the HyperRAM driver too IIRC so hopefully this such list processing change could make it into all my drivers, not just PSRAM.

            if_nz           jmp     #moretransfers          ' a b c  more transfers still to go
                            tjz     link, #listcomplete     ' a b c  test link for next request
                            rdlong  pa, ptrb[-1]            'check if list has been aborted by client
                            tjns    pa, #listcomplete       'will exit if it has
                            wrlong  link, ptrb              'setup list next pointer
                            altd    id, #id0                'compute COG's state address
                            bitl    0-0, #LIST_BIT          'clear list flag for this COG
            _ret_           push    #poller                 'we will return to polling
moretransfers               getbyte request, addr1, #3      'prepare next request

TBD, needs more checking. I'm still somewhat wary of side effects from this change as I used the Z flag (hopefully its original state was not required after this test).

evanh · 2022-05-26 08:25

@pik33 said:

But you won't have anything to plug the Edge Card into

Why? I have several 1st generation Edges and breakout boards with the edge connector.

Good. Not having any EC32MB, I assumed you had none at all.

pik33 · 2022-05-26 08:52

I am at the university now: my wife called and told me she received the board

I have Edges without SD slot and RAM. I even tried to use the same 4-bit PSRAM contraption I made for Eval with them with no success: too long or too complex shape of tracks or maybe something else limits the clock speed badly.

Then I have an idea to try again: I can still use clock/4 and this can make 16-bit work with both boards with the half speed. It is still 2x more bandwidth than one chip and 4 bits.

VonSzarvas · 2022-05-26 09:20

@pik33 said:
I am at the university now: my wife called and told me she received the board

Hurrah! Just in time

rogloh · 2022-05-26 11:04

@pik33
Not 100% sure yet but for the 16 bit PSRAM driver to enable LOCKED list processing as was done for the 4 bit version it might be possible to free 6 longs by replacing this code:

            if_nz           jmp     #moretransfers          ' a b c  more transfers still to go
                            tjz     link, #listcomplete     ' a b c  test link for next request
                            rdlong  pa, ptrb[-1]            'check if list has been aborted by client
                            tjns    pa, #listcomplete       'will exit if it has
                            wrlong  link, ptrb              'setup list next pointer
                            altd    id, #id0                'compute COG's state address
                            bitl    0-0, #LIST_BIT          'clear list flag for this COG
            _ret_           push    #poller                 'we will return to polling
moretransfers               getbyte request, addr1, #3      'prepare next request

with this code, in addition to the same LOCKED list change I provided earlier which is also required:

            if_z            mov     addr2, link
            if_z            jmp     #checknext
moretransfers               getbyte request, addr1, #3      'prepare next request

You should also do the same to the 4 bit version too as there is a list processing path for graphics copies through there as well.

If you get a chance tonight to try that out in your new 16 bit PSRAM setup, let me know if it works. I've looked for safe Z flag usage and so far re-using it seems okay, but the code gets quite squirrely in that area (yeah that's a technical term I first learned in San Jose, LOL) so I might have missed a pathway through it.

pik33 · 2022-05-26 11:35

I applied this patch to the 4-bit driver and the windows test still works
I don't use driver's graphic copies, I did it in the video driver.

A display list is fun :) [0.50a beta+0.07g - PSRAM command list used]

Comments