A display list is fun :) [0.50a beta+0.07g - PSRAM command list used] - Page 4 — Parallax Forums

A display list is fun :) [0.50a beta+0.07g - PSRAM command list used]



  • pik33pik33 Posts: 2,407

    @evanh said:

    @pik33 said:
    It can be a good idea to do window decoration out of sprites.

    Tiling, I should have said tiles rather partitions earlier, doesn't need border decorations. It won't be like a desktop GUI, but is that really desired?

    The current version of the Prop2Play uses this kind of pseudowindows, not movable, simple decorated, and there are 3 different pixel data sources: one for the scrolling info at the bottom, the second for the file info window, and the main buffer for the rest. As it is now, the file info window is blitted every frame to the main buffer: this takes about 5 ms, so I can still fit in the 20ms frame with all I need.
    The updated driver allows to switch to the scrolling window without blitting.

    And yes, I want a desktop-like GUI :)

  • pik33pik33 Posts: 2,407
    edited 2022-05-15 11:33

    I moved the repository to gitlab.

  • pik33pik33 Posts: 2,407
    edited 2022-05-24 07:31

    A bug costed me several days to catch.

    The symptom: the first line of the picture was bad: first 56 to 80 pixels was a copy of 4th line, counting from the bottom. Including any sprite that was there.

    The problem was streamer's FIFO

    The driver has 4 lines buffer in the hub, where it preloads data from the PSRAM. It is then used by the streamer in the RDFAST+LUT mode. When the streamer displayed the last line, the FIFO preloads several bytes from the start of the buffer which in this moment contains the line-3: nothing is preloaded because there are no lines to preload.

    Then at the start of the new frame the buffer is preloaded with the first line, but the streamer ignores this, because it has these longs already in the FIFO.

    The solution: init the rdfast at every frame after preloading the first line and not at the start so it can preload proper pixels into its FIFO

  • evanhevanh Posts: 16,282

    Yes, obvious in hindsight, FIFO is a pre-fetcher when in RDFAST mode. If the RAM has changed, usually from another cog/streamer, then the FIFO content becomes dirty.

    WRFAST FIFO operation is much more aggressive with its writes to hubRAM. It will write any fresh data ASAP. This is good in one sense but excessively consumes its hubRAM bandwidth. Since the FIFO has priority, this impacts its cog's accesses to hubRAM. It took a lot of testing to work this out - On ManAtWork's prompting I think it was.

  • pik33pik33 Posts: 2,407
    edited 2022-05-24 18:39

    A first window test. The window is a Basic class with its own graphic procedures, so it has its own independent PSRAM framebuffer, size, etc, including the cog which does the drawing. The video driver can place the window at any position of the screen using a display list and a PSRAM access list. To have more of them I have now to write a manager which can generate these lists on the fly based on windows positions.

    While the class mostly work, there is a bug with accessing PSRAM somewhere, the window interferes with the background.

  • pik33pik33 Posts: 2,407

    It seems now all basic graphic stuff works in the class, so it's time to start the manager.

  • This is pretty good stuff. I wonder if there will be sufficient time on the line for many overlapping windows, or just a very small number of them. Each time the request list runs from a new item, it will lose out some read bandwidth as it needs to start a new transfer which incurs overhead.

  • pik33pik33 Posts: 2,407

    In this case it's important to do the request list in one run until it ends, without checking another cogs and mailboxes, to save as much time as possible. I have now the driver open, but it is a complex piece of code, so I am still searching and trying to understand what happens there.

  • Ideally if you mark a COG's QoS attribute as LOCKED it could complete the request list before yielding, but I'm not sure if that is the current behaviour because I created request lists mainly for client writer COGs which are typically low priority, not really for video reader COGs, so I don't think I made it work that way. It could potentially be modified to do so however.

  • pik33pik33 Posts: 2,407

    The test results are strange.

    This is the test code:

    dim psram as class using "psram4"
    dim s(512)
    psram.startx(0, 0, 12, -1)
    let mbox=psram.getMailbox(0)
    dim list1(11)
    'cog1                                                '<-uncomment to run in the cog #0
    let cog=cpu(cog1,@s)                   '<-uncomment to run in its own cog 
    sub cog1
    for thecog=0 to 7:psram.setQos(thecog, 112 << 16) :next thecog
    psram.setQoS(cpuid(), $7FFFF400) 
    lpoke mbox+4,addr(list1)
    lpoke mbox,$FFFFFFFF
    let t1=getct()
    do : loop until (lpeek(mbox) and $80000000) = 0
    let t1=getct()-t1: print t1
    lpoke mbox+8,1024
    lpoke mbox+4,$60000
    lpoke mbox,$B060_0000
    let t1=getct()
    do : loop until (lpeek(mbox) and $80000000) = 0
    let t1=getct()-t1: print t1
    end sub
    sub lpoke(addr as ulong,value as ulong)
    wrlong value, addr
    end asm
    end sub
    function addr(byref v as const any) as ulong
    end function
    function lpeek(addr) as ulong
    dim r as ulong
    rdlong r,addr
    end asm
    return r
    end function

    Now, the code loads 1 kB using 3-element list and directly, then measure the time.
    I can do it either with the main cog calling the "cog1" procedure directly, or make the procedure run in its own cog.

    If it runs from the cog #0, the results are 5956, 5474
    If it runs using another cog, the results are 7150, 6795

    The difference between cogs is much too big, this is 1300 clocks, why?

  • That's strange. The only thing I can think of is that the COG order would add a few extra clocks if the polling loop is restarted per list element but this is a very small effect compared to your numbers above. Try removing the unused COG IDs from being polled by patching the value as 0 for the QoS value, instead of 112<<16 to see if that improves things...?

  • Actually I think I know why now... if I'm reading your BASIC program correctly it looks like you are using the same mailbox base address but just with a different burst size in the two tested cases. Shouldn't you be using a different mailbox address for each COG?

  • pik33pik33 Posts: 2,407

    Yes, that was the bug. Now it is 5956, 5517.
    It is something like 250 clocks overhead for every list element. I will try to test a 15-element, 1024 bytes list... 15 elements is the maximum for 8 windows on the screen

  • Good. I think the 16bit PSRAM will be useful in this application. Even if each list element adds ~1us overhead, 1024 x 576 8bpp might still be doable given your line scanning frequency is reasonably small. Your horizontal timing allows something like 28us or so per scan line doesn't it?

  • By the way I wonder if 960x540 might be nice to get working vs 1024x576 and it could very cleanly scale on a full HD monitor with less artifacts.

  • pik33pik33 Posts: 2,407
    edited 2022-05-25 12:01

    All less than 1024x576 can be done. The full beta 0.50 driver without sprites is (or: should be) configurable. This PSRAM sprite 0.08 thing is also configurable but in the current state it has hardcoded constants. They should and will be replaced with defined timings values: now it is still in the experimental stage. There was no place left for making sprites in 0.50, and that's why I did this new, simplified fork, which is PSRAM only, graphics only without character modes and without any horizontal hardware assisted zoom

    '                      bf.hs, hs,  bf.vis  visible, up p., vsync, down p.,  cpl, total lines, clock,       hubset                                scanlines  ud bord mode reserved
    timings         long   14,    80,  14,      1024,   6,     8,     6,        128, 576,         336956522,   %1_101101__11_0000_0110__1111_1011,   576,        0,     192, 0, 0
  • pik33pik33 Posts: 2,407
    edited 2022-05-25 13:19

    A test 15-element list time is 8577. Still fits in the line but not much time left. However, not every line will have 15-elements list, this is the worst case for 8 windows (including a background)

    I tried to decode the driver but it contains way too much skips and self modifying code to understand what happens there

  • @pik33 said:
    A test 15-element list time is 8577. Still fits in the line but not much time left. However, not every line will have 15-elements list, this is the worst case for 8 windows (including a background)

    That still sounds good then. And 16 bit PSRAM will be even better.

    I tried to decode the driver but it contains way too much skips and self modifying code to understand what happens there

    Yeah to pack in all the features, it needs a lot of skipf and other tricks.

  • roglohrogloh Posts: 5,882
    edited 2022-05-26 05:43

    To speed up the list process with the LOCKED attribute enabled it might be possible to do something like this...

    Now I've not checked it this works or if there are nasty side effects but if it does, it could help save quite a lot of cycles per list element (potentially up to ~86 clock cycles per list item, depending on COG poll order and HUB read latency etc) because it skips repolling and the per COG setup code and gets right back to list processing directly. It requires 4 free LUT RAM longs which hopefully might still be available in the 4 bit PSRAM driver.

    To try it (no guarantees), replace this block:

    checknext                   tjz     addr2, #listcomplete    'not special, check if the list is complete
                                rdlong  pa, ptrb[-1]            'check if list has been aborted by client
                                tjns    pa, #listcomplete       'exit if it has
                                skipf   #%1000                  'do not notify if list is continuing
                                wrlong  addr2, ptrb             ' a  write back next list address
    listcomplete                altd    id, #id0                ' a  compute COG's state address
                                bitl    0-0, #LIST_BIT          ' a  clear list flag for this COG
                _ret_           push    #notify                 ' |  we are done with the list
                _ret_           push    #poller                 ' a  we are still continuing the list

    with this:

    checknext                   tjz     addr2, #listcomplete    'not special, check if the list is complete
                                rdlong  pa, ptrb[-1]            'check if list has been aborted by client
                                tjns    pa, #listcomplete       'exit if it has
                                testb   id, #LOCKED_BIT wz
                if_z            mov     hubdata, addr2
                if_z            skipf   #%110001
                if_nz           skipf   #%1000                  'do not notify if list is continuing
                                wrlong  addr2, ptrb             ' a  write back next list address
    listcomplete                altd    id, #id0                ' a  compute COG's state address
                                bitl    0-0, #LIST_BIT          ' a  clear list flag for this COG
                _ret_           push    #notify                 ' |  we are done with the list
                _ret_           push    #poller                 ' a  we are still continuing the list
                _ret_           push    #real_list
  • pik33pik33 Posts: 2,407

    Now it is 7300..7600 (it varies from run to run) : about 1000 clocks gained and this can make 8-windows system possible.
    On the EC-32MB this may also allow 32-bit color graphics instead of 8-bit. Maybe I will receive EC-32MB tomorrow.

  • evanhevanh Posts: 16,282

    The anticipation! :) But you won't have anything to plug the Edge Card into. :(

  • @pik33 said:
    Now it is 7300..7600 (it varies from run to run) : about 1000 clocks gained and this can make 8-windows system possible.
    On the EC-32MB this may also allow 32-bit color graphics instead of 8-bit. Maybe I will receive EC-32MB tomorrow.

    That's great! I'm glad it works and saves a decent amount of clocks. I think there is some work to do on 16 bit driver to make room (I have 2 instructions free there and need 4). But there might be some scope as I have an idea...

  • pik33pik33 Posts: 2,407
    edited 2022-05-26 08:10

    But you won't have anything to plug the Edge Card into

    Why? I have several 1st generation Edges and breakout boards with the edge connector.

  • pik33pik33 Posts: 2,407

    (I have 2 instructions free there and need 4).

    Maybe something non-essential can be removed from there. What makes the 16-bit driver longer?

  • roglohrogloh Posts: 5,882
    edited 2022-05-26 08:21

    @pik33 said:

    (I have 2 instructions free there and need 4).

    Maybe something non-essential can be removed from there. What makes the 16-bit driver longer?

    It's far more complex to manage due to the addressing and spanning the 4 chips vs single chip. I need to check for and implement Read/Modify/Write on all writes that don't align to the native word size.

    I think there is some slight duplication of the list termination processing in LUT RAM elsewhere in this block here that may be able to share common code and this may free up the extra LUT longs needed...this code is also in the HyperRAM driver too IIRC so hopefully this such list processing change could make it into all my drivers, not just PSRAM.

                if_nz           jmp     #moretransfers          ' a b c  more transfers still to go
                                tjz     link, #listcomplete     ' a b c  test link for next request
                                rdlong  pa, ptrb[-1]            'check if list has been aborted by client
                                tjns    pa, #listcomplete       'will exit if it has
                                wrlong  link, ptrb              'setup list next pointer
                                altd    id, #id0                'compute COG's state address
                                bitl    0-0, #LIST_BIT          'clear list flag for this COG
                _ret_           push    #poller                 'we will return to polling
    moretransfers               getbyte request, addr1, #3      'prepare next request

    TBD, needs more checking. I'm still somewhat wary of side effects from this change as I used the Z flag (hopefully its original state was not required after this test).

  • evanhevanh Posts: 16,282

    @pik33 said:

    But you won't have anything to plug the Edge Card into

    Why? I have several 1st generation Edges and breakout boards with the edge connector.

    Good. Not having any EC32MB, I assumed you had none at all.

  • pik33pik33 Posts: 2,407

    I am at the university now: my wife called and told me she received the board :)

    I have Edges without SD slot and RAM. I even tried to use the same 4-bit PSRAM contraption I made for Eval with them with no success: too long or too complex shape of tracks or maybe something else limits the clock speed badly.

    Then I have an idea to try again: I can still use clock/4 and this can make 16-bit work with both boards with the half speed. It is still 2x more bandwidth than one chip and 4 bits.

  • VonSzarvasVonSzarvas Posts: 3,551
    edited 2022-05-26 09:21

    @pik33 said:
    I am at the university now: my wife called and told me she received the board :)

    Hurrah! Just in time :smiley:

  • roglohrogloh Posts: 5,882
    edited 2022-05-26 11:08

    Not 100% sure yet but for the 16 bit PSRAM driver to enable LOCKED list processing as was done for the 4 bit version it might be possible to free 6 longs by replacing this code:

                if_nz           jmp     #moretransfers          ' a b c  more transfers still to go
                                tjz     link, #listcomplete     ' a b c  test link for next request
                                rdlong  pa, ptrb[-1]            'check if list has been aborted by client
                                tjns    pa, #listcomplete       'will exit if it has
                                wrlong  link, ptrb              'setup list next pointer
                                altd    id, #id0                'compute COG's state address
                                bitl    0-0, #LIST_BIT          'clear list flag for this COG
                _ret_           push    #poller                 'we will return to polling
    moretransfers               getbyte request, addr1, #3      'prepare next request

    with this code, in addition to the same LOCKED list change I provided earlier which is also required:

                if_z            mov     addr2, link
                if_z            jmp     #checknext
    moretransfers               getbyte request, addr1, #3      'prepare next request

    You should also do the same to the 4 bit version too as there is a list processing path for graphics copies through there as well.

    If you get a chance tonight to try that out in your new 16 bit PSRAM setup, let me know if it works. I've looked for safe Z flag usage and so far re-using it seems okay, but the code gets quite squirrely in that area (yeah that's a technical term I first learned in San Jose, LOL) so I might have missed a pathway through it.

  • pik33pik33 Posts: 2,407
    edited 2022-05-26 11:36

    I applied this patch to the 4-bit driver and the windows test still works :)
    I don't use driver's graphic copies, I did it in the video driver.

