Yeah to exercise it you would need to include graphics copies in the request list. It won't be triggered otherwise. But it will be needed in the 16bit variant (mainly to free the required space for the real locked list change).
16 bit works. Test times are 1465 for 1k read and and 4300..4600 for 15-elements 1k list read (instead of 5500/7300 on 4 bit) . Minimum full stable delay is 11 at 354 MHz.
... but this board itself is unstable at 354. The program crashes, maybe the P2 is too hot there. A heatsink may be needed to work even at 336. I have no device to measure a temperature, but it is something about 45C when running a video test at 336. The prop2play will be more demanding: it uses up to 7 cogs.
32-bit PSRAM also allows to use one of standard VGA modes, where even FullHD can be done at lower CPU clock.
336 MHz 8 cogs is definitely stable for P2EDGE (well, mine, anyways). Take note that it needs a good power supply. If power is insufficient that can make it crash under load. But try a heatsink (or an ice fairy), might give some insight.
@pik33 said:
16 bit works. Test times are 1465 for 1k read and and 4300..4600 for 15-elements 1k list read (instead of 5500/7300 on 4 bit) . Minimum full stable delay is 11 at 354 MHz.
That's cool, you got 16bit locked lists working too. With DVI output using 10:1 clock ratio, and 4600 clocks for a 1024 pixel list (15 entries), this means you'd still have 10240-4600 = 3576 clocks available per active scan line. That still leaves 3576/10240 or about 35% bandwidth for writes (or extra windows or a higher bit depth) and that is not even accounting for the extra time available in any blanking portions.
This should be quite good to see working with multiple overlapping windows on the screen being moved about. I guess to move a graphics window in your manager you need to de-link it from each request list and re-link in the new position according to a sorted x co-ordinate on each scan line that the moved window covered before and covers after and also adjust some pixel lengths of other neighbouring windows. Hopefully that process can be made very fast with suitable data structures vs simply blitting in memory with my driver's graphics moves. Ideally PASM would do this window move operation, SPIN2 could be too slow once there are lots of larger windows on screen.
I have the algorithm I wrote on RPi. The algorithm looks like this: windows divide the screen into rectangles. I have to make a rectangle list, then assign every rectangle to a window, then, in RPi I simply blitted every rectangle to the main frame buffer. Here, I have to create a PSRAM list out of this rectangle list. I have either to fit all these computations in vblank time, which is something like 20 lines, 600 microseconds, or make 2 lists, preparing the second while the first is displaying.
The ping pong list generation with 2 lists could be simpler to manage although it adds a frame of latency to what gets displayed on screen. Having to do everything in a 600 microsecond budget might not be that easy unless the window count remains small, particularly if every scan line a window spans has to be computed independently, instead of reusing (adjusted) prior scan line list values.
I imagine a lot of the time the new scan line list entry addresses would just be a simple offset from the prior scan line, and the request list window source order and pixel lengths in the list could remain the same unless a window has just finished being displayed on the prior scan line or a new one has started on this scan line. So maybe it won't be too slow in the end with a small group of windows displayed. It will be interesting to see your results with this effort. Probably your actual limit in the end may not be 8 windows total on screen, but up to 8 per scan line or something like that. For a static setup this is useful but if you want to let them move around dynamically controlled by a user, a per scan line limit could be harder to enforce vs a per screen limit.
The list have only be computed if someone moves, closes or opens the window. When not moving, it is static. The problem can be the place for the list, it can grow big... or I can always place this list also in a PSRAM and preload it in hblank.
8 windows in a scan line and 8 windows total is the same, if I allow more windows and someone moves them, there can be more windows in one scan line. 7 windows + background in one scan line can give 15 position in the list and that's why I used 15 for testing.
I am now implementing this rectangle maker in Basic, so when it is ready I can measure how many clocks it needs. Time critical parts can then be rewritten in asm, if needed.
@pik33 said:
The list have only be computed if someone moves, closes or opens the window. When not moving, it is static. The problem can be the place for the list, it can grow big... or I can always place this list also in a PSRAM and preload it in hblank.
Yes this is good - a static list doesn't need recomputation each frame. So then it doesn't really matter too much how long it takes to recompute (if you use active/standby lists) except when you move a window around with a mouse for example and want to update the screen at 50 or 60Hz. Maybe the rendered window being moved can be treated as special and gets output last in the list to simplify this moving window case (and you only put it in its final place once it stops moving). I'm assuming that the displayed scan line gets buffered in HUB RAM anyway for your video driver so the actual order read from PSRAM isn't that important and the video driver outputs the previous scan line while you compose the next one - or is it actually overlapped more than this?
That's good having that extra buffer, then any currently moving window could always be handled separately at the start or end of the request list and its original position in the prior list could simply be bypassed/hidden so you can reveal more of the adjoining windows underneath/beside it.
As long as the required bandwidth is there, you don't necessarily have to read the various window segments from PSRAM on a given scan line from left to right in output display order. So you can basically temporarily hide/remove the original window position and have a moving window list item added on each affected scan line which reads in the needed portion at the appropriate HUB RAM position in the scan line buffer. Adjoining windows simply reveal any previously undrawn portion beside/under the moving window. That update step would need a per scan line adjustment though, and would ideally complete once per frame for rapid feedback. You'll need some sort of Z-order management too for all of this if you want to allow overlapping rectangular windows.
You can also use my driver's inbuilt graphics copy routines to populate the different window frame buffers with standard GUI controls like push buttons and check box images etc. These GUI controls could also stay in PSRAM. A decent GUI should be reasonably straightforward to achieve and be highly responsive once you have your window logic framework all sorted out. One trick will be dynamically changing the window size and opening/closing them dynamically under user control, once you allow that you'll likely want a heap or some other type of memory management of PSRAM.
Meanwhile I attached 2 heatsinks to the Edge, one directly on a P2, the second one on the back of the board, under the P2. They didn't stabilize it at 354, and it was stable at 336 without them, but the board is now much cooler, so the overall stability should be better.
Ada's experience suggests that deregistering the clock out pin to the PSRAM might help ... Or if that's the default then registering the data pins instead.
Yeah you may wish to experiment with PSRAM input timing further as I still haven't released any timing updates there for the P2-EC32MB and the default I used may not necessarily match your own board at the higher rates/temperatures. If you run the bundled psram_delay_test.spin2 utility from my drivers you can see the different delay breakpoints over frequency and you can then use this information to adjust the input delay timing for your particular operating frequency (and temperature) so things remain centered in the bands with as few errors as possible due to setup timing margins being exceeded or other jitter corrupting the sampled data.
Ok so the way you use this, is that if you wanted to operate at 354MHz you can see there are three delays values that work without error (100% success in columns for delays 11,12,13) You would probably pick the middle one being 12 as the one with the most margin. The current driver default value is also shown in parenthesis and is currently set to 13 which has been good above 350MHz but there's not a lot of margin from 350 to 354MHz. I'd use a delay value of 12 in this case, and not use the driver default.
The crash will be due to hubRAM data errors. Those mass failures above 366 MHz are due to hubRAM read/write contention and are temperature sensitive. HubRAM internal timing, when under contention, seems to give up just below where the PLL self-limiting kicks in. And will move down frequency as the Prop2 warms up.
@rogloh said:
... currently set to 13 which has been good above 350MHz but there's not a lot of margin from 350 to 354MHz. I'd use a delay value of 12 in this case, and not use the driver default.
It makes sense that deregistering the clock would have the same effect as -1 to the compensation delay. EDIT: Or is that what -1 actually does do?
You can set the delay with the setDelay(delay) method in the wrapper driver API. Or you can customize the PSRAM defaults by messing about with this long array below. This array gives the initial delay for frequencies below the first frequency value in the array as the first number, and then followed by an incrementing group of frequencies at which point it increases the delay by one as the operating frequency rises above these frequencies.
So in this case below:
Below 92MHz, delay=7
From 92-149MHz, delay=8
From 150-205MHz, delay=9
...etc
' delay profile
delayTable long 7,92_000000,150_000000,206_000000,258_000000,310_000000,333_000000,0
A new test bench prepared I have te recompile a kbm interface for a Pi Zero 2 which I used here hoping I can reach higher UART speed here than RPi Zero v1's 1.92 Mbps. This can enable more options than kbm only, for example a wifi interface for a P2 or a remote mp3 decoder.
How about SPI if you wanted a higher speed from the RasPi, instead of a UART? Ultibo surely should support that and looks like you have plenty of pins for it. 1.92Mbps is not very fast once you mix in those other uses you mentioned beyond a simple keyboard and mouse.
RPi 3 or 02, which is the same chip, can do UART (or SPI, or several other things) at 250 Mbps. Of course these bits will not fit in the wire between RPi and P2 but 10 Mbps... maybe can be possible. The main limiting factor is this wire and not the hardware.
There is of course SPI , but there is also i2s available for audio signals. i2s is also capable of several MBps and I already have the code for it. Both are synchronous so one more wire and delay stuff... UART is simpler.
The ultimate solution is SMI but this is the thing I have no experience at all.
Now I have to configure a new Ultibo on this 02 and recompile the interface - I didn't do anything with Ultibo for several months, there is a new version and something doesn't work as I expected, so I have now to learn what has to be changed after the major upgrade in Ultibo.
@evanh said:
The crash will be due to hubRAM data errors. Those mass failures above 366 MHz are due to hubRAM read/write contention and are temperature sensitive. HubRAM internal timing, when under contention, seems to give up just below where the PLL self-limiting kicks in. And will move down frequency as the Prop2 warms up.
Next comes hubRAM read failures when not in contention. And finally, when running at the PLL limit, even the writes fail.
Cogs are reliable at the PLL limit, as long as they're not touching hubRAM. I/O and smartpins are fine. I haven't tested the Cordic.
@pik33 said:
RPi 3 or 02, which is the same chip, can do UART (or SPI, or several other things) at 250 Mbps. Of course these bits will not fit in the wire between RPi and P2 but 10 Mbps... maybe can be possible. The main limiting factor is this wire and not the hardware.
Perhaps you can find a solution for those wire-speed limitations at the differential LVDS interface signaling.
TI's DS90LV027 (dual lvcmos-to-lvds driver) and DS90LV028 (dual lvds-to-lvcmos receiver) are a relativelly cheap and easy to use solution, requiring a 100 Ohm-impedance twisted pair as conductors for each differential lane, and a single 100 Ohm terminating resistor per lane, at the receiver-side (closer to DS90LV028s).
They come in 8-pin Soic packages, flow-thru layout-friendly pinout, and the power supply is 3.3 V.
A bit of caution with current consumption and proper bypassing capacitors will keep them happy at the frequencies you'll be able to use them.
Maximum specc'd signaling rates can be seen as a bit "scorching"; >600 Mbps for the drivers, and >400 Mbps for the receivers, so you can expect asynchronous UART serial bit-rates up to P2's Sysclk / 5 to work without much problems, despite they can be pushed up to Sysclk / 3, but this can be very sensitive to power-bypass and layout conditions.
If the RPi's can also handle synchronous serial, maybe bit-rates of P2's Sysclk / 4, and even Sysclk / 3 can be reached too.
I still have to connect it via RPi GPIO pins so it seems it is too much work, too low gain.
In tthe test bench I made, these wires are about 5 cm/2inch long, but to test this I still have to have a new Ultibo running and tested. It is installed now.
Ultibo went from 2.1 to 2.5 and the kbm interface no longer works, as my hacked modules are not compatible with the new system. I have now to learn how to make them run again.
... and this ended with a bug report: the dedicated RPi keyboard doesn't work with their example code and Zero 2 (works with RPi 3) So I connected the old Zero to the EC32 - it works too. Maybe zero 2 is still too new. All this interface was done to enable using this small and convenient RPi keyboard which is invisible to P2 USB driver due to its internal hub.
I am now testing the player at 354 MHz and it didn't crash yet although it is cog heavy (up to 7 at once) and PSRAM has also a lot of work to provide audio and video. There is something critical in this simple video test which causes it crash at 354 MHz...
Room temperature will likely be a factor. Cover it with some warm clothing. It shouldn't take long to crash as it warms up.
PS: I have a thermocouple soldered on bottom-middle of my Eval Board for measuring this. One glued to bottom heat-sink should do the job.
EDIT: Huh, the rules change at higher temperatures - HubRAM writes fail before reads!
EDIT2: 355 MHz topples at around 80 °C die temp. Thermal gradient would need to be calculated for your case. You'd need to start with the 1.8 V supply current measurement.
The boaard is now without any case and it has 2 small (P2 sized) heatsinks on both sides of the board. These heatsinks and the board itself have about 40C when runniong the player or the video test. The video test uses 5 cogs: the main cog, the video cog, the PSRAM cog, sprite moving cog (Basic procedure in a closed loop) and a cog which does graphics in the small window. The player uses up to 7 cogs (7 when playing SIDs : main, audio, video, psram, sidcog, 6502, playing loop) Maybe these 7 cogs are less busy than these 5 so they produce less heat. The player worked more than 2 hours without even a glitch, while the video test crashes in several seconds after the start. Tomorrow I will check the test program again. Today I will be not able to play with a P2.
Comments
Yeah to exercise it you would need to include graphics copies in the request list. It won't be triggered otherwise. But it will be needed in the 16bit variant (mainly to free the required space for the real locked list change).
P2-EC32 first run (still with a 4bit driver)
.... and it works without any glitches at 354 MHz using delay=12... Time to switch to 16-bit
(-- deleted, a stupid bug found)
16 bit works. Test times are 1465 for 1k read and and 4300..4600 for 15-elements 1k list read (instead of 5500/7300 on 4 bit) . Minimum full stable delay is 11 at 354 MHz.
... but this board itself is unstable at 354. The program crashes, maybe the P2 is too hot there. A heatsink may be needed to work even at 336. I have no device to measure a temperature, but it is something about 45C when running a video test at 336. The prop2play will be more demanding: it uses up to 7 cogs.
32-bit PSRAM also allows to use one of standard VGA modes, where even FullHD can be done at lower CPU clock.
336 MHz 8 cogs is definitely stable for P2EDGE (well, mine, anyways). Take note that it needs a good power supply. If power is insufficient that can make it crash under load. But try a heatsink (or an ice fairy), might give some insight.
That's cool, you got 16bit locked lists working too. With DVI output using 10:1 clock ratio, and 4600 clocks for a 1024 pixel list (15 entries), this means you'd still have 10240-4600 = 3576 clocks available per active scan line. That still leaves 3576/10240 or about 35% bandwidth for writes (or extra windows or a higher bit depth) and that is not even accounting for the extra time available in any blanking portions.
This should be quite good to see working with multiple overlapping windows on the screen being moved about. I guess to move a graphics window in your manager you need to de-link it from each request list and re-link in the new position according to a sorted x co-ordinate on each scan line that the moved window covered before and covers after and also adjust some pixel lengths of other neighbouring windows. Hopefully that process can be made very fast with suitable data structures vs simply blitting in memory with my driver's graphics moves. Ideally PASM would do this window move operation, SPIN2 could be too slow once there are lots of larger windows on screen.
I have the algorithm I wrote on RPi. The algorithm looks like this: windows divide the screen into rectangles. I have to make a rectangle list, then assign every rectangle to a window, then, in RPi I simply blitted every rectangle to the main frame buffer. Here, I have to create a PSRAM list out of this rectangle list. I have either to fit all these computations in vblank time, which is something like 20 lines, 600 microseconds, or make 2 lists, preparing the second while the first is displaying.
The ping pong list generation with 2 lists could be simpler to manage although it adds a frame of latency to what gets displayed on screen. Having to do everything in a 600 microsecond budget might not be that easy unless the window count remains small, particularly if every scan line a window spans has to be computed independently, instead of reusing (adjusted) prior scan line list values.
I imagine a lot of the time the new scan line list entry addresses would just be a simple offset from the prior scan line, and the request list window source order and pixel lengths in the list could remain the same unless a window has just finished being displayed on the prior scan line or a new one has started on this scan line. So maybe it won't be too slow in the end with a small group of windows displayed. It will be interesting to see your results with this effort. Probably your actual limit in the end may not be 8 windows total on screen, but up to 8 per scan line or something like that. For a static setup this is useful but if you want to let them move around dynamically controlled by a user, a per scan line limit could be harder to enforce vs a per screen limit.
The list have only be computed if someone moves, closes or opens the window. When not moving, it is static. The problem can be the place for the list, it can grow big... or I can always place this list also in a PSRAM and preload it in hblank.
8 windows in a scan line and 8 windows total is the same, if I allow more windows and someone moves them, there can be more windows in one scan line. 7 windows + background in one scan line can give 15 position in the list and that's why I used 15 for testing.
I am now implementing this rectangle maker in Basic, so when it is ready I can measure how many clocks it needs. Time critical parts can then be rewritten in asm, if needed.
Yes this is good - a static list doesn't need recomputation each frame. So then it doesn't really matter too much how long it takes to recompute (if you use active/standby lists) except when you move a window around with a mouse for example and want to update the screen at 50 or 60Hz. Maybe the rendered window being moved can be treated as special and gets output last in the list to simplify this moving window case (and you only put it in its final place once it stops moving). I'm assuming that the displayed scan line gets buffered in HUB RAM anyway for your video driver so the actual order read from PSRAM isn't that important and the video driver outputs the previous scan line while you compose the next one - or is it actually overlapped more than this?
There is a 4-line buffer. While the line is displaying, the line+2 is preloaded from PSRAM, and sprites are drawn on line+1
That's good having that extra buffer, then any currently moving window could always be handled separately at the start or end of the request list and its original position in the prior list could simply be bypassed/hidden so you can reveal more of the adjoining windows underneath/beside it.
As long as the required bandwidth is there, you don't necessarily have to read the various window segments from PSRAM on a given scan line from left to right in output display order. So you can basically temporarily hide/remove the original window position and have a moving window list item added on each affected scan line which reads in the needed portion at the appropriate HUB RAM position in the scan line buffer. Adjoining windows simply reveal any previously undrawn portion beside/under the moving window. That update step would need a per scan line adjustment though, and would ideally complete once per frame for rapid feedback. You'll need some sort of Z-order management too for all of this if you want to allow overlapping rectangular windows.
You can also use my driver's inbuilt graphics copy routines to populate the different window frame buffers with standard GUI controls like push buttons and check box images etc. These GUI controls could also stay in PSRAM. A decent GUI should be reasonably straightforward to achieve and be highly responsive once you have your window logic framework all sorted out. One trick will be dynamically changing the window size and opening/closing them dynamically under user control, once you allow that you'll likely want a heap or some other type of memory management of PSRAM.
To be tested
Meanwhile I attached 2 heatsinks to the Edge, one directly on a P2, the second one on the back of the board, under the P2. They didn't stabilize it at 354, and it was stable at 336 without them, but the board is now much cooler, so the overall stability should be better.
Ada's experience suggests that deregistering the clock out pin to the PSRAM might help ... Or if that's the default then registering the data pins instead.
Yeah you may wish to experiment with PSRAM input timing further as I still haven't released any timing updates there for the P2-EC32MB and the default I used may not necessarily match your own board at the higher rates/temperatures. If you run the bundled psram_delay_test.spin2 utility from my drivers you can see the different delay breakpoints over frequency and you can then use this information to adjust the input delay timing for your particular operating frequency (and temperature) so things remain centered in the bands with as few errors as possible due to setup timing margins being exceeded or other jitter corrupting the sampled data.
Here are the testing results. Tested from 250 MHz to crash at 374.
Ok so the way you use this, is that if you wanted to operate at 354MHz you can see there are three delays values that work without error (100% success in columns for delays 11,12,13) You would probably pick the middle one being 12 as the one with the most margin. The current driver default value is also shown in parenthesis and is currently set to 13 which has been good above 350MHz but there's not a lot of margin from 350 to 354MHz. I'd use a delay value of 12 in this case, and not use the driver default.
The crash will be due to hubRAM data errors. Those mass failures above 366 MHz are due to hubRAM read/write contention and are temperature sensitive. HubRAM internal timing, when under contention, seems to give up just below where the PLL self-limiting kicks in. And will move down frequency as the Prop2 warms up.
It makes sense that deregistering the clock would have the same effect as -1 to the compensation delay. EDIT: Or is that what -1 actually does do?
You can set the delay with the setDelay(delay) method in the wrapper driver API. Or you can customize the PSRAM defaults by messing about with this long array below. This array gives the initial delay for frequencies below the first frequency value in the array as the first number, and then followed by an incrementing group of frequencies at which point it increases the delay by one as the operating frequency rises above these frequencies.
So in this case below:
Below 92MHz, delay=7
From 92-149MHz, delay=8
From 150-205MHz, delay=9
...etc
A new test bench prepared I have te recompile a kbm interface for a Pi Zero 2 which I used here hoping I can reach higher UART speed here than RPi Zero v1's 1.92 Mbps. This can enable more options than kbm only, for example a wifi interface for a P2 or a remote mp3 decoder.
How about SPI if you wanted a higher speed from the RasPi, instead of a UART? Ultibo surely should support that and looks like you have plenty of pins for it. 1.92Mbps is not very fast once you mix in those other uses you mentioned beyond a simple keyboard and mouse.
RPi 3 or 02, which is the same chip, can do UART (or SPI, or several other things) at 250 Mbps. Of course these bits will not fit in the wire between RPi and P2 but 10 Mbps... maybe can be possible. The main limiting factor is this wire and not the hardware.
There is of course SPI , but there is also i2s available for audio signals. i2s is also capable of several MBps and I already have the code for it. Both are synchronous so one more wire and delay stuff... UART is simpler.
The ultimate solution is SMI but this is the thing I have no experience at all.
Now I have to configure a new Ultibo on this 02 and recompile the interface - I didn't do anything with Ultibo for several months, there is a new version and something doesn't work as I expected, so I have now to learn what has to be changed after the major upgrade in Ultibo.
Next comes hubRAM read failures when not in contention. And finally, when running at the PLL limit, even the writes fail.
Cogs are reliable at the PLL limit, as long as they're not touching hubRAM. I/O and smartpins are fine. I haven't tested the Cordic.
Perhaps you can find a solution for those wire-speed limitations at the differential LVDS interface signaling.
TI's DS90LV027 (dual lvcmos-to-lvds driver) and DS90LV028 (dual lvds-to-lvcmos receiver) are a relativelly cheap and easy to use solution, requiring a 100 Ohm-impedance twisted pair as conductors for each differential lane, and a single 100 Ohm terminating resistor per lane, at the receiver-side (closer to DS90LV028s).
They come in 8-pin Soic packages, flow-thru layout-friendly pinout, and the power supply is 3.3 V.
A bit of caution with current consumption and proper bypassing capacitors will keep them happy at the frequencies you'll be able to use them.
Maximum specc'd signaling rates can be seen as a bit "scorching"; >600 Mbps for the drivers, and >400 Mbps for the receivers, so you can expect asynchronous UART serial bit-rates up to P2's Sysclk / 5 to work without much problems, despite they can be pushed up to Sysclk / 3, but this can be very sensitive to power-bypass and layout conditions.
If the RPi's can also handle synchronous serial, maybe bit-rates of P2's Sysclk / 4, and even Sysclk / 3 can be reached too.
Hope it helps "frying" some bits, at least.
Henrique
P.S. forgot the links to the datasheets...
https://ti.com/lit/ds/symlink/ds90lv027aq-q1.pdf
https://ti.com/lit/ds/symlink/ds90lv028aq-q1.pdf
I still have to connect it via RPi GPIO pins so it seems it is too much work, too low gain.
In tthe test bench I made, these wires are about 5 cm/2inch long, but to test this I still have to have a new Ultibo running and tested. It is installed now.
Ultibo went from 2.1 to 2.5 and the kbm interface no longer works, as my hacked modules are not compatible with the new system. I have now to learn how to make them run again.
... and this ended with a bug report: the dedicated RPi keyboard doesn't work with their example code and Zero 2 (works with RPi 3) So I connected the old Zero to the EC32 - it works too. Maybe zero 2 is still too new. All this interface was done to enable using this small and convenient RPi keyboard which is invisible to P2 USB driver due to its internal hub.
I am now testing the player at 354 MHz and it didn't crash yet although it is cog heavy (up to 7 at once) and PSRAM has also a lot of work to provide audio and video. There is something critical in this simple video test which causes it crash at 354 MHz...
Room temperature will likely be a factor. Cover it with some warm clothing. It shouldn't take long to crash as it warms up.
PS: I have a thermocouple soldered on bottom-middle of my Eval Board for measuring this. One glued to bottom heat-sink should do the job.
EDIT: Huh, the rules change at higher temperatures - HubRAM writes fail before reads!
EDIT2: 355 MHz topples at around 80 °C die temp. Thermal gradient would need to be calculated for your case. You'd need to start with the 1.8 V supply current measurement.
The boaard is now without any case and it has 2 small (P2 sized) heatsinks on both sides of the board. These heatsinks and the board itself have about 40C when runniong the player or the video test. The video test uses 5 cogs: the main cog, the video cog, the PSRAM cog, sprite moving cog (Basic procedure in a closed loop) and a cog which does graphics in the small window. The player uses up to 7 cogs (7 when playing SIDs : main, audio, video, psram, sidcog, 6502, playing loop) Maybe these 7 cogs are less busy than these 5 so they produce less heat. The player worked more than 2 hours without even a glitch, while the video test crashes in several seconds after the start. Tomorrow I will check the test program again. Today I will be not able to play with a P2.
In general I expect heavier PSRAM activity will be higher power needed - Cogs do more, hubRAM does more, and pins do more.