@pik33 said:
It can be a good idea to do window decoration out of sprites.
Tiling, I should have said tiles rather partitions earlier, doesn't need border decorations. It won't be like a desktop GUI, but is that really desired?
The current version of the Prop2Play uses this kind of pseudowindows, not movable, simple decorated, and there are 3 different pixel data sources: one for the scrolling info at the bottom, the second for the file info window, and the main buffer for the rest. As it is now, the file info window is blitted every frame to the main buffer: this takes about 5 ms, so I can still fit in the 20ms frame with all I need.
The updated driver allows to switch to the scrolling window without blitting.
The symptom: the first line of the picture was bad: first 56 to 80 pixels was a copy of 4th line, counting from the bottom. Including any sprite that was there.
The problem was streamer's FIFO
The driver has 4 lines buffer in the hub, where it preloads data from the PSRAM. It is then used by the streamer in the RDFAST+LUT mode. When the streamer displayed the last line, the FIFO preloads several bytes from the start of the buffer which in this moment contains the line-3: nothing is preloaded because there are no lines to preload.
Then at the start of the new frame the buffer is preloaded with the first line, but the streamer ignores this, because it has these longs already in the FIFO.
The solution: init the rdfast at every frame after preloading the first line and not at the start so it can preload proper pixels into its FIFO
Yes, obvious in hindsight, FIFO is a pre-fetcher when in RDFAST mode. If the RAM has changed, usually from another cog/streamer, then the FIFO content becomes dirty.
WRFAST FIFO operation is much more aggressive with its writes to hubRAM. It will write any fresh data ASAP. This is good in one sense but excessively consumes its hubRAM bandwidth. Since the FIFO has priority, this impacts its cog's accesses to hubRAM. It took a lot of testing to work this out - On ManAtWork's prompting I think it was.
A first window test. The window is a Basic class with its own graphic procedures, so it has its own independent PSRAM framebuffer, size, etc, including the cog which does the drawing. The video driver can place the window at any position of the screen using a display list and a PSRAM access list. To have more of them I have now to write a manager which can generate these lists on the fly based on windows positions.
While the class mostly work, there is a bug with accessing PSRAM somewhere, the window interferes with the background.
This is pretty good stuff. I wonder if there will be sufficient time on the line for many overlapping windows, or just a very small number of them. Each time the request list runs from a new item, it will lose out some read bandwidth as it needs to start a new transfer which incurs overhead.
In this case it's important to do the request list in one run until it ends, without checking another cogs and mailboxes, to save as much time as possible. I have now the driver open, but it is a complex piece of code, so I am still searching and trying to understand what happens there.
Ideally if you mark a COG's QoS attribute as LOCKED it could complete the request list before yielding, but I'm not sure if that is the current behaviour because I created request lists mainly for client writer COGs which are typically low priority, not really for video reader COGs, so I don't think I made it work that way. It could potentially be modified to do so however.
dim psram as class using "psram4"
dim s(512)
psram.startx(0, 0, 12, -1)
let mbox=psram.getMailbox(0)
dim list1(11)
list1(0)=$B0000000+$600000
list1(1)=$60000
list1(2)=127
list1(3)=addr(list1(4))
list1(4)=$B0_600000+127
list1(5)=$60000+127
list1(6)=511
list1(7)=addr(list1(8))
list1(8)=$B0600000+511+127
list1(9)=$60000+511+127
list1(10)=1024-511-127
list1(11)=0
'cog1 '<-uncomment to run in the cog #0
let cog=cpu(cog1,@s) '<-uncomment to run in its own cog
sub cog1
for thecog=0 to 7:psram.setQos(thecog, 112 << 16) :next thecog
psram.setQoS(cpuid(), $7FFFF400)
waitms(1000)
lpoke mbox+4,addr(list1)
lpoke mbox,$FFFFFFFF
let t1=getct()
do : loop until (lpeek(mbox) and $80000000) = 0
let t1=getct()-t1: print t1
lpoke mbox+8,1024
lpoke mbox+4,$60000
lpoke mbox,$B060_0000
let t1=getct()
do : loop until (lpeek(mbox) and $80000000) = 0
let t1=getct()-t1: print t1
end sub
sub lpoke(addr as ulong,value as ulong)
asm
wrlong value, addr
end asm
end sub
function addr(byref v as const any) as ulong
return(cast(ulong,@v))
end function
function lpeek(addr) as ulong
dim r as ulong
asm
rdlong r,addr
end asm
return r
end function
Now, the code loads 1 kB using 3-element list and directly, then measure the time.
I can do it either with the main cog calling the "cog1" procedure directly, or make the procedure run in its own cog.
If it runs from the cog #0, the results are 5956, 5474
If it runs using another cog, the results are 7150, 6795
The difference between cogs is much too big, this is 1300 clocks, why?
That's strange. The only thing I can think of is that the COG order would add a few extra clocks if the polling loop is restarted per list element but this is a very small effect compared to your numbers above. Try removing the unused COG IDs from being polled by patching the value as 0 for the QoS value, instead of 112<<16 to see if that improves things...?
Actually I think I know why now... if I'm reading your BASIC program correctly it looks like you are using the same mailbox base address but just with a different burst size in the two tested cases. Shouldn't you be using a different mailbox address for each COG?
Yes, that was the bug. Now it is 5956, 5517.
It is something like 250 clocks overhead for every list element. I will try to test a 15-element, 1024 bytes list... 15 elements is the maximum for 8 windows on the screen
Good. I think the 16bit PSRAM will be useful in this application. Even if each list element adds ~1us overhead, 1024 x 576 8bpp might still be doable given your line scanning frequency is reasonably small. Your horizontal timing allows something like 28us or so per scan line doesn't it?
All less than 1024x576 can be done. The full beta 0.50 driver without sprites is (or: should be) configurable. This PSRAM sprite 0.08 thing is also configurable but in the current state it has hardcoded constants. They should and will be replaced with defined timings values: now it is still in the experimental stage. There was no place left for making sprites in 0.50, and that's why I did this new, simplified fork, which is PSRAM only, graphics only without character modes and without any horizontal hardware assisted zoom
' bf.hs, hs, bf.vis visible, up p., vsync, down p., cpl, total lines, clock, hubset scanlines ud bord mode reserved
timings long 14, 80, 14, 1024, 6, 8, 6, 128, 576, 336956522, %1_101101__11_0000_0110__1111_1011, 576, 0, 192, 0, 0
A test 15-element list time is 8577. Still fits in the line but not much time left. However, not every line will have 15-elements list, this is the worst case for 8 windows (including a background)
I tried to decode the driver but it contains way too much skips and self modifying code to understand what happens there
@pik33 said:
A test 15-element list time is 8577. Still fits in the line but not much time left. However, not every line will have 15-elements list, this is the worst case for 8 windows (including a background)
That still sounds good then. And 16 bit PSRAM will be even better.
I tried to decode the driver but it contains way too much skips and self modifying code to understand what happens there
Yeah to pack in all the features, it needs a lot of skipf and other tricks.
To speed up the list process with the LOCKED attribute enabled it might be possible to do something like this...
Now I've not checked it this works or if there are nasty side effects but if it does, it could help save quite a lot of cycles per list element (potentially up to ~86 clock cycles per list item, depending on COG poll order and HUB read latency etc) because it skips repolling and the per COG setup code and gets right back to list processing directly. It requires 4 free LUT RAM longs which hopefully might still be available in the 4 bit PSRAM driver.
To try it (no guarantees), replace this block:
checknext tjz addr2, #listcomplete 'not special, check if the list is complete
rdlong pa, ptrb[-1] 'check if list has been aborted by client
tjns pa, #listcomplete 'exit if it has
skipf #%1000 'do not notify if list is continuing
wrlong addr2, ptrb ' a write back next list address
listcomplete altd id, #id0 ' a compute COG's state address
bitl 0-0, #LIST_BIT ' a clear list flag for this COG
_ret_ push #notify ' | we are done with the list
_ret_ push #poller ' a we are still continuing the list
with this:
checknext tjz addr2, #listcomplete 'not special, check if the list is complete
rdlong pa, ptrb[-1] 'check if list has been aborted by client
tjns pa, #listcomplete 'exit if it has
testb id, #LOCKED_BIT wz
if_z mov hubdata, addr2
if_z skipf #%110001
if_nz skipf #%1000 'do not notify if list is continuing
wrlong addr2, ptrb ' a write back next list address
listcomplete altd id, #id0 ' a compute COG's state address
bitl 0-0, #LIST_BIT ' a clear list flag for this COG
_ret_ push #notify ' | we are done with the list
_ret_ push #poller ' a we are still continuing the list
_ret_ push #real_list
Now it is 7300..7600 (it varies from run to run) : about 1000 clocks gained and this can make 8-windows system possible.
On the EC-32MB this may also allow 32-bit color graphics instead of 8-bit. Maybe I will receive EC-32MB tomorrow.
@pik33 said:
Now it is 7300..7600 (it varies from run to run) : about 1000 clocks gained and this can make 8-windows system possible.
On the EC-32MB this may also allow 32-bit color graphics instead of 8-bit. Maybe I will receive EC-32MB tomorrow.
That's great! I'm glad it works and saves a decent amount of clocks. I think there is some work to do on 16 bit driver to make room (I have 2 instructions free there and need 4). But there might be some scope as I have an idea...
Maybe something non-essential can be removed from there. What makes the 16-bit driver longer?
It's far more complex to manage due to the addressing and spanning the 4 chips vs single chip. I need to check for and implement Read/Modify/Write on all writes that don't align to the native word size.
I think there is some slight duplication of the list termination processing in LUT RAM elsewhere in this block here that may be able to share common code and this may free up the extra LUT longs needed...this code is also in the HyperRAM driver too IIRC so hopefully this such list processing change could make it into all my drivers, not just PSRAM.
if_nz jmp #moretransfers ' a b c more transfers still to go
tjz link, #listcomplete ' a b c test link for next request
rdlong pa, ptrb[-1] 'check if list has been aborted by client
tjns pa, #listcomplete 'will exit if it has
wrlong link, ptrb 'setup list next pointer
altd id, #id0 'compute COG's state address
bitl 0-0, #LIST_BIT 'clear list flag for this COG
_ret_ push #poller 'we will return to polling
moretransfers getbyte request, addr1, #3 'prepare next request
TBD, needs more checking. I'm still somewhat wary of side effects from this change as I used the Z flag (hopefully its original state was not required after this test).
I am at the university now: my wife called and told me she received the board
I have Edges without SD slot and RAM. I even tried to use the same 4-bit PSRAM contraption I made for Eval with them with no success: too long or too complex shape of tracks or maybe something else limits the clock speed badly.
Then I have an idea to try again: I can still use clock/4 and this can make 16-bit work with both boards with the half speed. It is still 2x more bandwidth than one chip and 4 bits.
@pik33
Not 100% sure yet but for the 16 bit PSRAM driver to enable LOCKED list processing as was done for the 4 bit version it might be possible to free 6 longs by replacing this code:
if_nz jmp #moretransfers ' a b c more transfers still to go
tjz link, #listcomplete ' a b c test link for next request
rdlong pa, ptrb[-1] 'check if list has been aborted by client
tjns pa, #listcomplete 'will exit if it has
wrlong link, ptrb 'setup list next pointer
altd id, #id0 'compute COG's state address
bitl 0-0, #LIST_BIT 'clear list flag for this COG
_ret_ push #poller 'we will return to polling
moretransfers getbyte request, addr1, #3 'prepare next request
with this code, in addition to the same LOCKED list change I provided earlier which is also required:
if_z mov addr2, link
if_z jmp #checknext
moretransfers getbyte request, addr1, #3 'prepare next request
You should also do the same to the 4 bit version too as there is a list processing path for graphics copies through there as well.
If you get a chance tonight to try that out in your new 16 bit PSRAM setup, let me know if it works. I've looked for safe Z flag usage and so far re-using it seems okay, but the code gets quite squirrely in that area (yeah that's a technical term I first learned in San Jose, LOL) so I might have missed a pathway through it.
Comments
The current version of the Prop2Play uses this kind of pseudowindows, not movable, simple decorated, and there are 3 different pixel data sources: one for the scrolling info at the bottom, the second for the file info window, and the main buffer for the rest. As it is now, the file info window is blitted every frame to the main buffer: this takes about 5 ms, so I can still fit in the 20ms frame with all I need.
The updated driver allows to switch to the scrolling window without blitting.
And yes, I want a desktop-like GUI
I moved the repository to gitlab.
https://gitlab.com/pik33/P2-retromachine
A bug costed me several days to catch.
The symptom: the first line of the picture was bad: first 56 to 80 pixels was a copy of 4th line, counting from the bottom. Including any sprite that was there.
The problem was streamer's FIFO
The driver has 4 lines buffer in the hub, where it preloads data from the PSRAM. It is then used by the streamer in the RDFAST+LUT mode. When the streamer displayed the last line, the FIFO preloads several bytes from the start of the buffer which in this moment contains the line-3: nothing is preloaded because there are no lines to preload.
Then at the start of the new frame the buffer is preloaded with the first line, but the streamer ignores this, because it has these longs already in the FIFO.
The solution: init the rdfast at every frame after preloading the first line and not at the start so it can preload proper pixels into its FIFO
Yes, obvious in hindsight, FIFO is a pre-fetcher when in RDFAST mode. If the RAM has changed, usually from another cog/streamer, then the FIFO content becomes dirty.
WRFAST FIFO operation is much more aggressive with its writes to hubRAM. It will write any fresh data ASAP. This is good in one sense but excessively consumes its hubRAM bandwidth. Since the FIFO has priority, this impacts its cog's accesses to hubRAM. It took a lot of testing to work this out - On ManAtWork's prompting I think it was.
A first window test. The window is a Basic class with its own graphic procedures, so it has its own independent PSRAM framebuffer, size, etc, including the cog which does the drawing. The video driver can place the window at any position of the screen using a display list and a PSRAM access list. To have more of them I have now to write a manager which can generate these lists on the fly based on windows positions.
While the class mostly work, there is a bug with accessing PSRAM somewhere, the window interferes with the background.
It seems now all basic graphic stuff works in the class, so it's time to start the manager.
This is pretty good stuff. I wonder if there will be sufficient time on the line for many overlapping windows, or just a very small number of them. Each time the request list runs from a new item, it will lose out some read bandwidth as it needs to start a new transfer which incurs overhead.
In this case it's important to do the request list in one run until it ends, without checking another cogs and mailboxes, to save as much time as possible. I have now the driver open, but it is a complex piece of code, so I am still searching and trying to understand what happens there.
Ideally if you mark a COG's QoS attribute as LOCKED it could complete the request list before yielding, but I'm not sure if that is the current behaviour because I created request lists mainly for client writer COGs which are typically low priority, not really for video reader COGs, so I don't think I made it work that way. It could potentially be modified to do so however.
The test results are strange.
This is the test code:
Now, the code loads 1 kB using 3-element list and directly, then measure the time.
I can do it either with the main cog calling the "cog1" procedure directly, or make the procedure run in its own cog.
If it runs from the cog #0, the results are 5956, 5474
If it runs using another cog, the results are 7150, 6795
The difference between cogs is much too big, this is 1300 clocks, why?
That's strange. The only thing I can think of is that the COG order would add a few extra clocks if the polling loop is restarted per list element but this is a very small effect compared to your numbers above. Try removing the unused COG IDs from being polled by patching the value as 0 for the QoS value, instead of 112<<16 to see if that improves things...?
Actually I think I know why now... if I'm reading your BASIC program correctly it looks like you are using the same mailbox base address but just with a different burst size in the two tested cases. Shouldn't you be using a different mailbox address for each COG?
Yes, that was the bug. Now it is 5956, 5517.
It is something like 250 clocks overhead for every list element. I will try to test a 15-element, 1024 bytes list... 15 elements is the maximum for 8 windows on the screen
Good. I think the 16bit PSRAM will be useful in this application. Even if each list element adds ~1us overhead, 1024 x 576 8bpp might still be doable given your line scanning frequency is reasonably small. Your horizontal timing allows something like 28us or so per scan line doesn't it?
By the way I wonder if 960x540 might be nice to get working vs 1024x576 and it could very cleanly scale on a full HD monitor with less artifacts.
All less than 1024x576 can be done. The full beta 0.50 driver without sprites is (or: should be) configurable. This PSRAM sprite 0.08 thing is also configurable but in the current state it has hardcoded constants. They should and will be replaced with defined timings values: now it is still in the experimental stage. There was no place left for making sprites in 0.50, and that's why I did this new, simplified fork, which is PSRAM only, graphics only without character modes and without any horizontal hardware assisted zoom
A test 15-element list time is 8577. Still fits in the line but not much time left. However, not every line will have 15-elements list, this is the worst case for 8 windows (including a background)
I tried to decode the driver but it contains way too much skips and self modifying code to understand what happens there
That still sounds good then. And 16 bit PSRAM will be even better.
Yeah to pack in all the features, it needs a lot of skipf and other tricks.
To speed up the list process with the LOCKED attribute enabled it might be possible to do something like this...
Now I've not checked it this works or if there are nasty side effects but if it does, it could help save quite a lot of cycles per list element (potentially up to ~86 clock cycles per list item, depending on COG poll order and HUB read latency etc) because it skips repolling and the per COG setup code and gets right back to list processing directly. It requires 4 free LUT RAM longs which hopefully might still be available in the 4 bit PSRAM driver.
To try it (no guarantees), replace this block:
with this:
Now it is 7300..7600 (it varies from run to run) : about 1000 clocks gained and this can make 8-windows system possible.
On the EC-32MB this may also allow 32-bit color graphics instead of 8-bit. Maybe I will receive EC-32MB tomorrow.
The anticipation! But you won't have anything to plug the Edge Card into.
That's great! I'm glad it works and saves a decent amount of clocks. I think there is some work to do on 16 bit driver to make room (I have 2 instructions free there and need 4). But there might be some scope as I have an idea...
Why? I have several 1st generation Edges and breakout boards with the edge connector.
Maybe something non-essential can be removed from there. What makes the 16-bit driver longer?
It's far more complex to manage due to the addressing and spanning the 4 chips vs single chip. I need to check for and implement Read/Modify/Write on all writes that don't align to the native word size.
I think there is some slight duplication of the list termination processing in LUT RAM elsewhere in this block here that may be able to share common code and this may free up the extra LUT longs needed...this code is also in the HyperRAM driver too IIRC so hopefully this such list processing change could make it into all my drivers, not just PSRAM.
TBD, needs more checking. I'm still somewhat wary of side effects from this change as I used the Z flag (hopefully its original state was not required after this test).
Good. Not having any EC32MB, I assumed you had none at all.
I am at the university now: my wife called and told me she received the board
I have Edges without SD slot and RAM. I even tried to use the same 4-bit PSRAM contraption I made for Eval with them with no success: too long or too complex shape of tracks or maybe something else limits the clock speed badly.
Then I have an idea to try again: I can still use clock/4 and this can make 16-bit work with both boards with the half speed. It is still 2x more bandwidth than one chip and 4 bits.
Hurrah! Just in time
@pik33
Not 100% sure yet but for the 16 bit PSRAM driver to enable LOCKED list processing as was done for the 4 bit version it might be possible to free 6 longs by replacing this code:
with this code, in addition to the same LOCKED list change I provided earlier which is also required:
You should also do the same to the 4 bit version too as there is a list processing path for graphics copies through there as well.
If you get a chance tonight to try that out in your new 16 bit PSRAM setup, let me know if it works. I've looked for safe Z flag usage and so far re-using it seems okay, but the code gets quite squirrely in that area (yeah that's a technical term I first learned in San Jose, LOL) so I might have missed a pathway through it.
I applied this patch to the 4-bit driver and the windows test still works
I don't use driver's graphic copies, I did it in the video driver.