Crazy yep. Be good to see a 55 inch vertical on the wall of a home! LOL.
Here's a quick zip file demo of the hires portrait thing in case you want to see for yourself and you have a 1920 pixel wide portrait capable VGA monitor (or just tilt your head!).
It needs the P2-EVAL REVB with a VGA breakout on pins 8-15.
Warning: The P2 will be clocked at 308MHz and 297MHz for 1920x1200 and 1920x1080 respectively.
Two versions are included, one at each resolution listed above. I sort of faked smooth scrolling a little as you will see, but it could be done properly I think.
Use loadp2 or with flexgui to load the binary into the P2 board.
> @Cluso99 said:
> hip,
> Your 55" vertical monitors are 4K aren't they?
>
> So say 3840x20482160 with 8x8 font gives 480x 256270 and your monitors are vertical, so 480 lines of 270 characters per monitor !!!
They are 4k, but you can only signal above 1920x1200 by using HDMI, unfortumately.
At 4k, you could also get TWO columns of 135 characters by 480 lines, for a total of 960 lines!
> @Cluso99 said:
> hip,
> Your 55" vertical monitors are 4K aren't they?
>
> So say 3840x20482160 with 8x8 font gives 480x 256270 and your monitors are vertical, so 480 lines of 270 characters per monitor !!!
They are 4k, but you can only signal above 1920x1200 by using HDMI, unfortumately.
At 4k, you could also get TWO columns of 135 characters by 480 lines, for a total of 960 lines!
Yes. I am spoilt too. I have 3 x 24" 1920x1080 monitors on my home PC and the same at work although there I do have a second PC so one monitor occasionally get switched (HDMI/VGA in the monitors menu) to the second PC which mainly runs batch python jobs. Just love my 3 monitors although I am jealous of those pair of 55" 4K ones of yours!
Today I managed to hack my HyperRAM arbiter code to split the video transfers into sub 4us sized portions on burst reads. So far a video frame buffer using it looks good on the screen with 640x480 VGA 16bpp but am still testing to see if that corruption I saw previously comes back. I don't think it should now with this change.
Right now looking at a scope the total read overhead per burst seems to be about 0.78us in the overall 3.75us time with !CS low when running the P2 at 200MHz (100MHz HyperRAM) and I'd like to improve it a little more if I can. The address setup phase is not yet using the streamer so that can be sped up (right now it is byte banging). It seems to currently take about 450ns for that portion before data is returned and I lose the remainder of the 0.78us in each sub-burst iteration comparing and adjusting addresses and other software loop logic etc. Despite that, even now the overall efficiency of these burst transfers at this HyperRAM clock rate is still quite good though at about 83%, but it could be improved.
This was using sysclk/1 transfers and I guess that is not always going to be reliable (board layout + RAM & P2 temp/process dependencies etc), but sysclk/1 should stress it more.
Here are some scope pics I captured showing the above. I captured my HyperRAM !CS signal on the yellow trace, and my HyperRAM clock signal on the blue trace. Because the bandwidth of my scope is poor, the clock at high frequencies looks like it tristates at an intermediate voltage, but this is incorrect and it actually indicates when the high speed streamer transfer is happening.
The first picture shows two sequential video scan line external memory transfers about 32us apart with two sub bursts each. 640 bytes were being transferred per sub burst in this 640x480 16bpp mode.
The second picture shows the sub burst transfers for a single scan line in more detail and you can start to see the overhead portion where the clock signal is not sitting at that intermediate level.
The third picture shows the end of one sub burst and the beginning of another, with the overhead in more detail. In this portion we need to complete the end of the sub-burst transfer, stop the clock and raise !CS high. Then we need to do some PASM loop housekeeping and restart the second portion by dropping !CS and setting up the address phase of the next sub-burst before beginning the transfer again after the final clock pause. Total overhead is about 780ns per 640 byte sub-burst transfer at this clock speed.
I should be able to tighten up the portion where the clock is being byte banged relatively slowly and showing up as sine wave in the picture above if I use the streamer for this.
Rogloh, that is great! In the first picture, I understand you are getting two sets of 640*2 bytes. It looks like you are only using 25% of the available bandwidth. Is that so? That is awesome!
Yes that is right. There is quite a bit of extra bandwidth left over with VGA @16bpp resolution and sysclk/1 operation, and you can achieve 24bpp too. I think the P2 at 200MHz is a sweet spot for analog VGA (8x pixel clk) with HyperRAM then getting its maximum rated 200MB/s at sysclk/1 operation. Or you could tweak the PLL output closer to 201.4MHz if you needed a purer 25.175MHz dot clock. Usually it won't matter and 25MHz can do for a dot clock.
DVI with it's 10x clock requirement means the HyperRAM bus really needs to run around 126MHz (or 63MHz) when the P2 is running at 252MHz for the 640x480 resolution. With my current settings I've found 126MHz to be a bit too high (26% overclock) and 63MHz drops things down to 126MB/s maximum HyperRAM transfer speed, which is still enough for this resolution with 16bpp.
There are a few other sweet spots. 200MHz is also nice for SVGA as well (5x pixel clock), then 195MHz for XGA (3x), and 216MHz for SXGA (8% memory overclock). These keep the HyperRAM bus clock operating at a high value while providing a P2 clock at a nice integer multiple of the standard pixel clock for these resolutions.
Where sysclk/1 operation is not consistently reliable and we need to drop to sysclk/2 then these operating clock rates need to be reconsidered to try to maximize the p2 clock / 4 value, keeping it as close to 100MHz where possible if you need the highest performance and wish to provide the most remaining shared external memory bandwidth to other COGs.
Yes that is right. There is quite a bit of extra bandwidth left over with VGA @16bpp resolution and sysclk/1 operation, and you can achieve 24bpp too.
Hmm, at 480p, how many bytes can be fetched per line? Enough to composite a couple of big (like, 256 pixel wide) 32bpp RGBA bitmaps with rendered graphics?
@Wuerfel_21 In my HyperRAM arbiter driver at least, the left over bandwidth not used by the video driver's requirement is allocated to other COG requests. The amount of write data bandwidth available per scanline to a non-video COG is a function of the time remaining after the video request completes until its next request, the time it takes to round-robin poll all the COGs for a request, the burst size and the maximum limit I impose to try to keep the latency down for the video request. It is also affected by the competition with other non-video COG requests if there are more than one of these COGs.
I currently restrict a non-video COG to a 256 byte burst and it takes about 60 P2 clocks plus egg-beater latency per polling iteration to check all 8 COGs. The current burst transfer overhead is mentioned above about 0.78us. So say roughly 1us of overhead per COG write burst request when you include the polling stuff. So with these values you might be able to estimate how much time there is for your transfer. It's a reasonable amount in the best case. I think we are probably talking something in the ballpark of 4-8 COG large write burst opportunities per scanline with 480p @ 32bpp, so up to ~63MB/s if sysclk/1 operation is used. It's even higher for 8 or 16bpp.
EDIT: The above incorrectly assumed sysclk/1 writes. Right now we do not have that. Writes are done at sysclk/2, so divide the values by 2. Only reads work (in some cases) with sysclk/1, and sysclk/2 is probably more realistic.
If you can manage sysclk/1 read operation on your board and need truecolour 24bpp colour depth, it looks like SVGA at 200MHz is a nice spot for the P2 to operate with a single 16 bit HyperRAM. On the scope it looks like it leaves around 9.5MB/s of write bandwidth per scanline for other COGs accesses, more if you include the extra capacity from the vertical blanking lines. Of course, this value is lowered if you are doing read/modify/write on the HyperRAM contents, because you need to do the read as well (though that will also operate at sysclk/1 if the video does).
Yellow is the HyperRAM chip select and two SVGA video scanlines are shown with 5x640 byte sub-burst transfers per scanline = 3200 bytes, being 800x32 bit pixels.
UPDATE:
Some real world numbers....
Just timed the HyperRAM writes with SVGA@24bpp in a tight SPIN2 loop with the P2 at 200MHz. With this test code writing 100MB:
'Benchmarking COG:
w := cnt ' capture start time
repeat 1024
ram.writeburst(ram#RAM, 0, z, 10240*10) ' write 100kB
z+=2
w:= cnt -w ' compute elapsed time
vid.dec(w) ' print result
...
'HyperRAM driver:
PUB writeburst(bank, addr, hubaddr, totalLen) | len, burst
if totalLen =< 0
return 0
len := totalLen
repeat while len > 0
if len > BURSTLIMIT ' 256 bytes max (for now)
burst := BURSTLIMIT
else
burst := len
mailbox[cogid+8] := burst << 24 + (hubaddr & $fffff)
mailbox[cogid] := REQ_WRITEBURST + (bank & $f) << 24 + (addr & $ffffff)
repeat until mailbox[cogid] => 0
hubaddr += burst
addr += burst
len -= burst
return totalLen
I get a value of 864716978 which corresponds to 4.32 seconds at 200MHz.
This is 100MB/4.32s = 23.1MB/s of available write bandwidth (here 1MB = 1024*1024 bytes). There is some SPIN2 overhead from the test itself which was found to be 59342834 clocks (0.3 seconds) in each case when the mailbox result loop was skipped.
At 16bpp running the same test I get 527748434 or 2.64s, which is 37.9MB/s including the test overhead.
At 8bpp running the same test I get 443502770 or 2.21s, which is 45.1MB/s including the test overhead.
With this burst limit and test setup it looks like 16bpp is somewhat of a sweet spot and you don't really sacrifice very much write bandwidth over the 8bpp mode. Of course you will have to write twice the graphics data.
I tested other resolutions and here were the results obtained with 100MB of HyperRAM write bursts from a second COG while the video driver COG was actively reading this frame buffer from HyperRAM. For now this is probably about as good as it gets with sysclk/1 operation on reads and the full rated bandwidth on the HyperRAM bus if a FASTSPIN driver is used to break up the bursts as I did in the code above (using PASM2 instead of SPIN2 would be faster if I break up the requested bursts in the HyperRAM driver instead, which I may ultimately do).
I could try increasing the burst size limit a little further but this can introduce more latency in the video data that may prevent a mouse being rendered. There's going to be a different optimal burst size limit for each resolution and colour depth.
Update: Separately to the above I tested 1280x1024 (P2 @ 216MHz) and got 31MB/s of write bandwidth at 8bpp. I couldn't get any remaining burst bandwidth in 16bpp for any writes (it's flat out just with the video), and at 24bpp it was not possible to get a proper frame from HyperRAM (insufficient total bandwidth).
@Wuerfel_21 In my HyperRAM arbiter driver at least, the left over bandwidth not used by the video driver's requirement is allocated to other COG requests. The amount of write data bandwidth available per scanline to a non-video COG is a function of the time remaining after the video request completes until its next request, the time it takes to round-robin poll all the COGs for a request, the burst size and the maximum limit I impose to try to keep the latency down for the video request. It is also affected by the competition with other non-video COG requests if there are more than one of these COGs.
I currently restrict a non-video COG to a 256 byte burst and it takes about 60 P2 clocks plus egg-beater latency per polling iteration to check all 8 COGs. The current burst transfer overhead is mentioned above about 0.78us. So say roughly 1us of overhead per COG write burst request when you include the polling stuff. So with these values you might be able to estimate how much time there is for your transfer. It's a reasonable amount in the best case. I think we are probably talking something in the ballpark of 4-8 COG large write burst opportunities per scanline with 480p @ 32bpp, so up to ~63MB/s if sysclk/1 operation is used. It's even higher for 8 or 16bpp.
EDIT: The above incorrectly assumed sysclk/1 writes. Right now we do not have that. Writes are done at sysclk/2, so divide the values by 2. Only reads work (in some cases) with sysclk/1, and sysclk/2 is probably more realistic.
I thought this HyperRAM arbiter code is running in one COG, with the video COG in another. If that's the case you really only need to poll six other COGs for non-video requests. Reducing this number to four might be reasonable if you are using a mailbox scheme.
I don't know whether you've considered the use of COGATN for flagging non-video requests. This could reduce the polling time, although the servicing time would likely be unaffected, and some sort of priority scheme or COG service history scheme may be required to prevent starvation of individual non-video COGs.
I thought this HyperRAM arbiter code is running in one COG, with the video COG in another. If that's the case you really only need to poll six other COGs for non-video requests. Reducing this number to four might be reasonable if you are using a mailbox scheme.
I don't know whether you've considered the use of COGATN for flagging non-video requests. This could reduce the polling time, although the servicing time would likely be unaffected, and some sort of priority scheme or COG service history scheme may be required to prevent starvation of individual non-video COGs.
It's true that reducing to 6 round robin COGs could possibly save 8 clocks inside the polling loop, with some additional complexity to manage it, especially if things change dynamically, but given the overall overhead time for a single burst read/write request is in the vicinity of 242 total clocks, saving those 8 clocks won't change much.
The COGATN is another alternative. I've not investigated it yet, just trying to get something simple/generic going with polling hub memory. This way any P2 language that can read/write hub memory can make requests.
Here's the current inner polling loop
reloadmailbox setq #16-1 'setup for reading 16 longs
rdlong req0, mbox 'read all mailbox requests/data from hub
prioritycog tjs req0-0, cog0jmp-0 'check highest priority COG (eg. video COG)
rep @.rrloop, #8 'round robin poll COGs
alti testrequest, alti_sub 'increment through all 8 COGs
tjs req0-0, cog0jmp-0 'if service requested, jump to cog handler
.rrloop
jmp #reloadmailbox 'retry if no requests
testrequest tjs req0, cog0jmp
Not shown above but at the end of servicing a real request I also cycle the first COG polled in this round robin loop to introduce fairness and prevent the same COG from getting polled first. It's just one instruction (alti) used to cycle it.
EDIT: Actually that fairness cycling operation adds a lot of complexity if alternatives to the above code are employed, and it may burn up the 8 saved cycles. It might be easier to keep this polling code unless COGATN from various COGs is a better way to do it. I also wondered if SKIPF could be used with a rotating skip pattern for sequence priority. Maybe with only a 6 COG polling sequence and a 7 skip distance maximum that could work if the jumps cancel the SKIPF pattern.
This is what I was thinking may work with SKIPF, assuming a jump cancels the skipf (does it?). It may save ~10 cycles in the loop perhaps if it works, but it adds 6 more clocks to the service request, so the benefit is lessened, still may be an improvement though for reducing service latency:
cyclesequence shl pattern, #1 'code called after a valid service
or pattern, #2
testb pattern, #13 wz 'test for completion of cycle in pattern
if_nz mov pattern, #%111111_000000_0
reloadmailbox setq #16-1 'setup for reading 16 longs
rdlong req0, mbox 'read all mailbox requests/data from hub
skipf pattern
prioritycog tjs req0-0, cog0jmp-0 'check highest priority COG first (eg. video COG)
tjs req1-0, cog1jmp-0 'these get patched with 6 non-video and non-HyperRAM COGs
tjs req2-0, cog2jmp-0
tjs req3-0, cog3jmp-0
tjs req4-0, cog4jmp-0
tjs req5-0, cog5jmp-0
tjs req6-0, cog6jmp-0
tjs req1-0, cog1jmp-0 'these get patched with 6 non-video and non-HyperRAM COGs
tjs req2-0, cog2jmp-0
tjs req3-0, cog3jmp-0
tjs req4-0, cog4jmp-0
tjs req5-0, cog5jmp-0
tjs req6-0, cog6jmp-0
jmp #reloadmailbox 'repeat cycle
pattern long 0 'zero value forces re-init the first time it is used
This is what I was thinking may work with SKIPF, assuming a jump cancels the skipf (does it?).
No, remaining skip bits will still be in play after TJS.
A jump that ends a skip pattern could be done by using EXECF with skip bits zeroed, e.g. with 9-bit immediate if jumping to cog RAM. However, EXECF cannot do a test as well.
Ok thanks TonyB_, that kills the SKIPF idea. My current way with the REP and ALTI with TJS to cycle through the COGs seems reasonably fast. I think I'll keep it for now.
When I counted up the clock cycles in the current HyperRAM polling loop I found it was 66-73 depending on egg-beater latency. Given the same mailbox address is being read every time I guessed it might be better to tighten the loop down to 64 (a multiple of 8) which I can do by just unrolling the REP loop. Interestingly I found that by doing this, it was consistently giving slightly slower performance compared to the REP loop. Compared to the test results in the table above, values decreased by ~ 1-2MBps. I'm not sure what is causing this.
Also been thinking about the COGATN instruction. I'm not entirely sure if/where it will be of benefit here given the COG sourcing the COGATN indication is unknown by the receiving COG when the receiver is the HyperRAM driver, and in that case we still need to read all the mailboxes anyway to determine it. I'm still deciding...
In the reverse direction, I guess we could use COGATN to indicate that the RAM transaction is complete so the requesting COG doesn't have to keep polling for this result, however this may not save a lot. It's easy to do, but not all COG requestors may want this, especially if they use COGATN for other purposes, so it would have to be indicated. I have an out of band control channel that could do that.
Perhaps the video COG could use COGATN exclusively on its requests and the HyperRAM COG could know that means a high-priority request has come in without checking first. But given that the burst from any low priority COG needs to complete before the high-priority COG gets the next go, again I'm not sure how much benefit this small latency reduction really is going to be.
@rogloh If I bump the system clock up to 320 mhz, the video output should be fine, right? Your driver uses the smart pins to generate output, so as long as I meet the *minimum* clock freq, it should be fine?
The reason for asking is that I routinely run at 320 mhz. The chip is bulletproof there, and some of the code running on other cogs could really use the extra speed.
320MHz driver operation should work ok. I've been using it even higher up to 350MHz or so with some HDTV modes. It'll probably do even more but I wasn't game to try in case of overheating etc and I didn't have a fan. Other people have had the P2 up even more. It's up to you to see how far you want to push it...
Update: You will need to adjust your video timings to suit the higher clock speed, otherwise the HSYNC/VSYNC refresh rates will be scaled from their defaults (which may still work depending on your monitor's input range). I'm assuming you are using the analog VGA output.
When I counted up the clock cycles in the current HyperRAM polling loop I found it was 66-73 depending on egg-beater latency. Given the same mailbox address is being read every time I guessed it might be better to tighten the loop down to 64 (a multiple of 8) which I can do by just unrolling the REP loop. Interestingly I found that by doing this, it was consistently giving slightly slower performance compared to the REP loop. Compared to the test results in the table above, values decreased by ~ 1-2MBps. I'm not sure what is causing this.
Also been thinking about the COGATN instruction. I'm not entirely sure if/where it will be of benefit here given the COG sourcing the COGATN indication is unknown by the receiving COG when the receiver is the HyperRAM driver, and in that case we still need to read all the mailboxes anyway to determine it. I'm still deciding...
In the reverse direction, I guess we could use COGATN to indicate that the RAM transaction is complete so the requesting COG doesn't have to keep polling for this result, however this may not save a lot. It's easy to do, but not all COG requestors may want this, especially if they use COGATN for other purposes, so it would have to be indicated. I have an out of band control channel that could do that.
Perhaps the video COG could use COGATN exclusively on its requests and the HyperRAM COG could know that means a high-priority request has come in without checking first. But given that the burst from any low priority COG needs to complete before the high-priority COG gets the next go, again I'm not sure how much benefit this small latency reduction really is going to be.
Yes, in hindsight it would have been good if the receiving COG received a 16 bit field to indicate which COG(s) were signalling it.
I had not taken into account that the video COG needs to update the pointers on region changes, so it's not sufficient to just trigger the next transfer for the video COG.
As the arbiter COG doesn't need to be checked for requests, you could trim the table by two longs and only need to test seven entries. Not a big saving but all the same probably worth doing.
@AJL, The problem is I am making good use of the 8 register wrap around feature with ALTI incrementing D&S fields in a single instruction, so polling 7 entries instead is problematic if there are gaps in the sequence being skipped. I could unroll the loop but then I need to update 7 entries with new values after every service request processed to implement the round robin poller with fairness, such that the first COG polled after the video priority COG is found to not need service is cycling each time. If you don't do this cycling it actually prioritizes the non-video COGs.
Here is the alti constant I am using in the code above.
alti_sub long %000_110_110_000_111_111 'increment D,S on 8 long boundary & substitute
Rogloh, might it help to read in your mailboxes using SETQ+RDLONG, then write out your effects using SETQ+WMLONG ?
I would also straight-line the polling code to get rid of ALTx instructions. This HyperRAM server exists once in the chip and everything else depends on its performance. Larger code is warranted.
I've been thinking lately that after the chip circuitry is optimized to execute atomic instructions as fast as possible, additional gains can only come from hardware macro functions, like SETQ+RDLONG or by using the smartpins and streamer together. Figuring how to leverage those things gives big improvements.
I would also straight-line the polling code to get rid of ALTx instructions. This HyperRAM server exists once in the chip and everything else depends on its performance. Larger code is warranted.
Yeah, given the steps that need to happen there are still a few places left for optimisations:
1) unrolling the main request polling loops - this needs to be replicated 6 or 7 times, excluding the HyperRAM COG itself from the sequence. Potentially it could be configured to poll less than 7 COGs if this is known at setup time. A future config option perhaps.
2) inlining the address setup phase subroutine into its caller's code path - this may save 8 cycles on CALL/RETs, needs 3 replications.
3) try to use the streamer/smartpins for the address setup phase where possible. This is slightly complicated as the data to be sent is 48 bits and has to come from immediate register possibly with a special delay in between for clock control...TBD.
Rogloh, might it help to read in your mailboxes using SETQ+RDLONG, then write out your effects using SETQ+WMLONG ?
I'm making handy use of SETQ+RDLONG for mailbox polling in the current code. For writing back results to hub I think SETQ + WMLONG may not help in this case because each individual request only needs a single long written back to HUB (the data results get sent back by the streamer).
Yesterday I tried to divide up the low priority COG requests so their larger burst transfers which could otherwise starve out a video COG could still be fully handled inside the HyperRAM COG instead of needing to be broken up first by the caller (as I did in the benchmark SPIN2 example above). I'm still playing with this idea but I observed that in higher video load situations the gains from that change were possibly being lost by the extra instructions needed to save/restore the state itself so I may need to rethink it, though there may still be things I can do there to improve this and it may be sensitive to the burst size chosen and how it divides into the scan line time budget remaining. I may need to separate the read code paths, one for video reads the other for non-video reads.
I wonder if we can make a single general purpose HyperRAM driver to cover everything. There are different applications here and I can envisage multiple common usage cases...
(1) a (single) video COG + one or more reader/writer COGs requestors using it
(2) a single HyperRAM requestor COG (non-video) which gets exclusive access
(3) multiple HyperRAM requestors COGs (non-video)
Each case has different restrictions and could be optimised differently. Case (1) is obviously what I am playing with high now. In comparison, case (2) is rather simple. Case (3) might be a variant of case (1) without a high priority COG nominated, or it could possibly go further and give different COGs different weights. Don't even ask about doing case (1) with multiple video COGs...arggh!
Maybe different driver variants should ultimately be developed for each case or a common driver can be configured/modified at init time in different ways....we'll have see where things go.
Here's an unrolled polling structure I might try next. It has some key benefits:
1) provides the round-robin poll order for request fairness (but not bandwidth fairness)
2) eliminates per loop REP, ALTI overhead and extra 4 cycle JMP at end of polling loop...loop is now 40 clocks?
3) allows for less than 8 COGs to be polled by just setting the pollcount value
Downsides:
1) adds some setup complexity to configure (one time cost),
2) If the priority COG changes dynamically after initialisation this is more complex to manage as multiple locations in the code need patching (pri1, pri2, pri3 etc).
3) The low priority COG polling code also needs replication and cycling over multiple polling loops because up to 7 copies need updating. Some block COG RAM copies could help here in REP loops once the first sequence is created.
If this works I think it is worth doing.
poll1 mov poller, #poll2 'contains where to return to after next service
rep pollcount, #0 'pollcount = 2 + number of COGs to poll
setq #16-1 'setup for reading 16 longs
rdlong req0, mbox 'read all mailbox requests/data from hub
pri1 tjs req0-0, priority_jmp 'priority COG checked first
tjs req1, cog1_handler 'then cog check order 1,2,3,4,5,6
tjs req2, cog2_handler
tjs req3, cog3_handler
tjs req4, cog4_handler
tjs req5, cog5_handler
tjs req6, cog6_handler
poll2 mov poller, #poll3 'poller contains where to jump next after service
rep pollcount, #0 'pollcount = 2 + number of COGs to poll
setq #16-1 'setup for reading 16 longs
rdlong req0, mbox 'read all mailbox requests/data from hub
pri2 tjs req0-0, priority_jmp 'priority COG checked first
tjs req2, cog2_handler 'then cog check order 2,3,4,5,6,1
tjs req3, cog3_handler
tjs req4, cog4_handler
tjs req5, cog5_handler
tjs req6, cog6_handler
tjs req1, cog1_handler
poll3 mov poller, #poll4 'poller contains where to jump next after service
rep pollcount, #0 'pollcount = 2 + number of COGs to poll
rep #9, #0
setq #16-1 'setup for reading 16 longs
rdlong req0, mbox 'read all mailbox requests/data from hub
pri3 tjs req0-0, priority_jmp 'priority COG checked first
tjs req3, cog3_handler 'the cog check order 3,4,5,6,1,2
tjs req4, cog4_handler
tjs req5, cog5_handler
tjs req6, cog6_handler
tjs req1, cog1_handler
tjs req2, cog2_handler
poll4 ...
poll5 ...
poll6 ... etc
after service ends: jmp poller
Ok, no worries. I guess I have tried various resolutions and dot clocks. That's one of them for a handy 2x HDTV clock for the P2. Here a useful list of them I put together if anyone wants one of these frequencies...
@rogloh, Your poll3 seems to have an extra rep in there (rep #9, #0).
To reduce the amount of code space needed, could you have a 13 long block in hub ram with the tjs instructions?
Then read in the necessary section for each service.
I don't know how you are using the PTRx registers, so I've avoided them, although they could make this simpler still.
Something like:
org
poll setq pollcnt 'setup for reading number of COGs to poll - 1 longs
rdlong #\poll_loop, poll_tbl
rep pollcount, #0 'pollcount = 2 + number of COGs to poll
setq #16-1 'setup for reading 16 longs
rdlong req0, mbox 'read all mailbox requests/data from hub
pri tjs req0-0, priority_jmp 'priority COG checked first
poll_loop tjs 0-0, 0-0 'then cog check order as loaded for this service loop
tjs 0-0, 0-0
tjs 0-0, 0-0
tjs 0-0, 0-0
tjs 0-0, 0-0
tjs 0-0, 0-0
after service ends:
incmod poll_idx, pollcnt
altr poll_tbl
add poll_idx, #@poll_base
jump poll
orgh
poll_base tjs req1, cog1_handler
tjs req2, cog2_handler
tjs req3, cog3_handler
tjs req4, cog4_handler
tjs req5, cog5_handler
tjs req6, cog6_handler
tjs req1, cog1_handler
tjs req2, cog2_handler
tjs req3, cog3_handler
tjs req4, cog4_handler
tjs req5, cog5_handler
[/quote]
The wait loops execute in the same amount of time, but the time between wait loops is longer; smaller memory footprint, and only one long to patch for dynamic priority changes.
Also, adding a new requestor involves patching at most two longs in the table at poll_base, and updating pollcount and pollcnt (to be pollcount-3).
Comments
Your 55" vertical monitors are 4K aren't they?
So say 3840x20482160 with 8x8 font gives 480x 256270 and your monitors are vertical, so 480 lines of 270 characters per monitor !!!
Here's a quick zip file demo of the hires portrait thing in case you want to see for yourself and you have a 1920 pixel wide portrait capable VGA monitor (or just tilt your head!).
It needs the P2-EVAL REVB with a VGA breakout on pins 8-15.
Warning: The P2 will be clocked at 308MHz and 297MHz for 1920x1200 and 1920x1080 respectively.
Two versions are included, one at each resolution listed above. I sort of faked smooth scrolling a little as you will see, but it could be done properly I think.
Use loadp2 or with flexgui to load the binary into the P2 board.
> hip,
> Your 55" vertical monitors are 4K aren't they?
>
> So say 3840x20482160 with 8x8 font gives 480x 256270 and your monitors are vertical, so 480 lines of 270 characters per monitor !!!
They are 4k, but you can only signal above 1920x1200 by using HDMI, unfortumately.
At 4k, you could also get TWO columns of 135 characters by 480 lines, for a total of 960 lines!
Yes. I am spoilt too. I have 3 x 24" 1920x1080 monitors on my home PC and the same at work although there I do have a second PC so one monitor occasionally get switched (HDMI/VGA in the monitors menu) to the second PC which mainly runs batch python jobs. Just love my 3 monitors although I am jealous of those pair of 55" 4K ones of yours!
Right now looking at a scope the total read overhead per burst seems to be about 0.78us in the overall 3.75us time with !CS low when running the P2 at 200MHz (100MHz HyperRAM) and I'd like to improve it a little more if I can. The address setup phase is not yet using the streamer so that can be sped up (right now it is byte banging). It seems to currently take about 450ns for that portion before data is returned and I lose the remainder of the 0.78us in each sub-burst iteration comparing and adjusting addresses and other software loop logic etc. Despite that, even now the overall efficiency of these burst transfers at this HyperRAM clock rate is still quite good though at about 83%, but it could be improved.
This was using sysclk/1 transfers and I guess that is not always going to be reliable (board layout + RAM & P2 temp/process dependencies etc), but sysclk/1 should stress it more.
The first picture shows two sequential video scan line external memory transfers about 32us apart with two sub bursts each. 640 bytes were being transferred per sub burst in this 640x480 16bpp mode.
The second picture shows the sub burst transfers for a single scan line in more detail and you can start to see the overhead portion where the clock signal is not sitting at that intermediate level.
The third picture shows the end of one sub burst and the beginning of another, with the overhead in more detail. In this portion we need to complete the end of the sub-burst transfer, stop the clock and raise !CS high. Then we need to do some PASM loop housekeeping and restart the second portion by dropping !CS and setting up the address phase of the next sub-burst before beginning the transfer again after the final clock pause. Total overhead is about 780ns per 640 byte sub-burst transfer at this clock speed.
I should be able to tighten up the portion where the clock is being byte banged relatively slowly and showing up as sine wave in the picture above if I use the streamer for this.
DVI with it's 10x clock requirement means the HyperRAM bus really needs to run around 126MHz (or 63MHz) when the P2 is running at 252MHz for the 640x480 resolution. With my current settings I've found 126MHz to be a bit too high (26% overclock) and 63MHz drops things down to 126MB/s maximum HyperRAM transfer speed, which is still enough for this resolution with 16bpp.
There are a few other sweet spots. 200MHz is also nice for SVGA as well (5x pixel clock), then 195MHz for XGA (3x), and 216MHz for SXGA (8% memory overclock). These keep the HyperRAM bus clock operating at a high value while providing a P2 clock at a nice integer multiple of the standard pixel clock for these resolutions.
Where sysclk/1 operation is not consistently reliable and we need to drop to sysclk/2 then these operating clock rates need to be reconsidered to try to maximize the p2 clock / 4 value, keeping it as close to 100MHz where possible if you need the highest performance and wish to provide the most remaining shared external memory bandwidth to other COGs.
Hmm, at 480p, how many bytes can be fetched per line? Enough to composite a couple of big (like, 256 pixel wide) 32bpp RGBA bitmaps with rendered graphics?
I currently restrict a non-video COG to a 256 byte burst and it takes about 60 P2 clocks plus egg-beater latency per polling iteration to check all 8 COGs. The current burst transfer overhead is mentioned above about 0.78us. So say roughly 1us of overhead per COG write burst request when you include the polling stuff. So with these values you might be able to estimate how much time there is for your transfer. It's a reasonable amount in the best case. I think we are probably talking something in the ballpark of 4-8 COG large write burst opportunities per scanline with 480p @ 32bpp, so up to ~63MB/s if sysclk/1 operation is used. It's even higher for 8 or 16bpp.
EDIT: The above incorrectly assumed sysclk/1 writes. Right now we do not have that. Writes are done at sysclk/2, so divide the values by 2. Only reads work (in some cases) with sysclk/1, and sysclk/2 is probably more realistic.
Yellow is the HyperRAM chip select and two SVGA video scanlines are shown with 5x640 byte sub-burst transfers per scanline = 3200 bytes, being 800x32 bit pixels.
UPDATE:
Some real world numbers....
Just timed the HyperRAM writes with SVGA@24bpp in a tight SPIN2 loop with the P2 at 200MHz. With this test code writing 100MB: I get a value of 864716978 which corresponds to 4.32 seconds at 200MHz.
This is 100MB/4.32s = 23.1MB/s of available write bandwidth (here 1MB = 1024*1024 bytes). There is some SPIN2 overhead from the test itself which was found to be 59342834 clocks (0.3 seconds) in each case when the mailbox result loop was skipped.
At 16bpp running the same test I get 527748434 or 2.64s, which is 37.9MB/s including the test overhead.
At 8bpp running the same test I get 443502770 or 2.21s, which is 45.1MB/s including the test overhead.
With this burst limit and test setup it looks like 16bpp is somewhat of a sweet spot and you don't really sacrifice very much write bandwidth over the 8bpp mode. Of course you will have to write twice the graphics data.
I found multiple runs gives consistent results.
I could try increasing the burst size limit a little further but this can introduce more latency in the video data that may prevent a mouse being rendered. There's going to be a different optimal burst size limit for each resolution and colour depth.
Update: Separately to the above I tested 1280x1024 (P2 @ 216MHz) and got 31MB/s of write bandwidth at 8bpp. I couldn't get any remaining burst bandwidth in 16bpp for any writes (it's flat out just with the video), and at 24bpp it was not possible to get a proper frame from HyperRAM (insufficient total bandwidth).
I thought this HyperRAM arbiter code is running in one COG, with the video COG in another. If that's the case you really only need to poll six other COGs for non-video requests. Reducing this number to four might be reasonable if you are using a mailbox scheme.
I don't know whether you've considered the use of COGATN for flagging non-video requests. This could reduce the polling time, although the servicing time would likely be unaffected, and some sort of priority scheme or COG service history scheme may be required to prevent starvation of individual non-video COGs.
It's true that reducing to 6 round robin COGs could possibly save 8 clocks inside the polling loop, with some additional complexity to manage it, especially if things change dynamically, but given the overall overhead time for a single burst read/write request is in the vicinity of 242 total clocks, saving those 8 clocks won't change much.
The COGATN is another alternative. I've not investigated it yet, just trying to get something simple/generic going with polling hub memory. This way any P2 language that can read/write hub memory can make requests.
Here's the current inner polling loop
Not shown above but at the end of servicing a real request I also cycle the first COG polled in this round robin loop to introduce fairness and prevent the same COG from getting polled first. It's just one instruction (alti) used to cycle it.
EDIT: Actually that fairness cycling operation adds a lot of complexity if alternatives to the above code are employed, and it may burn up the 8 saved cycles. It might be easier to keep this polling code unless COGATN from various COGs is a better way to do it. I also wondered if SKIPF could be used with a rotating skip pattern for sequence priority. Maybe with only a 6 COG polling sequence and a 7 skip distance maximum that could work if the jumps cancel the SKIPF pattern.
A jump that ends a skip pattern could be done by using EXECF with skip bits zeroed, e.g. with 9-bit immediate if jumping to cog RAM. However, EXECF cannot do a test as well.
Also been thinking about the COGATN instruction. I'm not entirely sure if/where it will be of benefit here given the COG sourcing the COGATN indication is unknown by the receiving COG when the receiver is the HyperRAM driver, and in that case we still need to read all the mailboxes anyway to determine it. I'm still deciding...
In the reverse direction, I guess we could use COGATN to indicate that the RAM transaction is complete so the requesting COG doesn't have to keep polling for this result, however this may not save a lot. It's easy to do, but not all COG requestors may want this, especially if they use COGATN for other purposes, so it would have to be indicated. I have an out of band control channel that could do that.
Perhaps the video COG could use COGATN exclusively on its requests and the HyperRAM COG could know that means a high-priority request has come in without checking first. But given that the burst from any low priority COG needs to complete before the high-priority COG gets the next go, again I'm not sure how much benefit this small latency reduction really is going to be.
The reason for asking is that I routinely run at 320 mhz. The chip is bulletproof there, and some of the code running on other cogs could really use the extra speed.
Update: You will need to adjust your video timings to suit the higher clock speed, otherwise the HSYNC/VSYNC refresh rates will be scaled from their defaults (which may still work depending on your monitor's input range). I'm assuming you are using the analog VGA output.
If th
Yes, in hindsight it would have been good if the receiving COG received a 16 bit field to indicate which COG(s) were signalling it.
I had not taken into account that the video COG needs to update the pointers on region changes, so it's not sufficient to just trigger the next transfer for the video COG.
As the arbiter COG doesn't need to be checked for requests, you could trim the table by two longs and only need to test seven entries. Not a big saving but all the same probably worth doing.
Here is the alti constant I am using in the code above.
I would also straight-line the polling code to get rid of ALTx instructions. This HyperRAM server exists once in the chip and everything else depends on its performance. Larger code is warranted.
I've been thinking lately that after the chip circuitry is optimized to execute atomic instructions as fast as possible, additional gains can only come from hardware macro functions, like SETQ+RDLONG or by using the smartpins and streamer together. Figuring how to leverage those things gives big improvements.
Yeah, given the steps that need to happen there are still a few places left for optimisations:
1) unrolling the main request polling loops - this needs to be replicated 6 or 7 times, excluding the HyperRAM COG itself from the sequence. Potentially it could be configured to poll less than 7 COGs if this is known at setup time. A future config option perhaps.
2) inlining the address setup phase subroutine into its caller's code path - this may save 8 cycles on CALL/RETs, needs 3 replications.
3) try to use the streamer/smartpins for the address setup phase where possible. This is slightly complicated as the data to be sent is 48 bits and has to come from immediate register possibly with a special delay in between for clock control...TBD.
I'm making handy use of SETQ+RDLONG for mailbox polling in the current code. For writing back results to hub I think SETQ + WMLONG may not help in this case because each individual request only needs a single long written back to HUB (the data results get sent back by the streamer).
Yesterday I tried to divide up the low priority COG requests so their larger burst transfers which could otherwise starve out a video COG could still be fully handled inside the HyperRAM COG instead of needing to be broken up first by the caller (as I did in the benchmark SPIN2 example above). I'm still playing with this idea but I observed that in higher video load situations the gains from that change were possibly being lost by the extra instructions needed to save/restore the state itself so I may need to rethink it, though there may still be things I can do there to improve this and it may be sensitive to the burst size chosen and how it divides into the scan line time budget remaining. I may need to separate the read code paths, one for video reads the other for non-video reads.
I wonder if we can make a single general purpose HyperRAM driver to cover everything. There are different applications here and I can envisage multiple common usage cases...
(1) a (single) video COG + one or more reader/writer COGs requestors using it
(2) a single HyperRAM requestor COG (non-video) which gets exclusive access
(3) multiple HyperRAM requestors COGs (non-video)
Each case has different restrictions and could be optimised differently. Case (1) is obviously what I am playing with high now. In comparison, case (2) is rather simple. Case (3) might be a variant of case (1) without a high priority COG nominated, or it could possibly go further and give different COGs different weights. Don't even ask about doing case (1) with multiple video COGs...arggh!
Maybe different driver variants should ultimately be developed for each case or a common driver can be configured/modified at init time in different ways....we'll have see where things go.
1) provides the round-robin poll order for request fairness (but not bandwidth fairness)
2) eliminates per loop REP, ALTI overhead and extra 4 cycle JMP at end of polling loop...loop is now 40 clocks?
3) allows for less than 8 COGs to be polled by just setting the pollcount value
Downsides:
1) adds some setup complexity to configure (one time cost),
2) If the priority COG changes dynamically after initialisation this is more complex to manage as multiple locations in the code need patching (pri1, pri2, pri3 etc).
3) The low priority COG polling code also needs replication and cycling over multiple polling loops because up to 7 copies need updating. Some block COG RAM copies could help here in REP loops once the first sequence is created.
If this works I think it is worth doing.
I was looking at you code and see you can do this now with Rev.B:
This actually works very well for 1080p. Glad you looked into this...
To reduce the amount of code space needed, could you have a 13 long block in hub ram with the tjs instructions?
Then read in the necessary section for each service.
I don't know how you are using the PTRx registers, so I've avoided them, although they could make this simpler still.
Something like: [/quote]
The wait loops execute in the same amount of time, but the time between wait loops is longer; smaller memory footprint, and only one long to patch for dynamic priority changes.
Also, adding a new requestor involves patching at most two longs in the table at poll_base, and updating pollcount and pollcnt (to be pollcount-3).
Food for thought.