The usual P1 XMM model (as implemented in GCC and probably similiar in Catalina, but IDK ask @RossH ) ...
No point in asking me - I have no idea how XMM is implemented in GCC. I know Steve Densen did some of the original work on the cache which (I believe) both GCC and Catalina use to support XMM where the RAM itself is too slow to access directly (e.g. SRAM) - so you are probably right that they are basically similar.
I am still trying to come to grips with whether it is worth implementing code execution from XMM on the P2. I keep wavering between doing a trivial "P1" type port - which would be very easy but suffer from the same problems as XMM code execution has on the P1 (i.e. that it is quite slow) - or doing something more sophisticated.
The problem is that I really have no idea yet whether XMM will be widely used on the P2. Even on the P1 (which absolutely needed it to execute large programs) it didn't see very much use outside us hardcore fanatics.
I will probably continue to waver on this for a bit yet. There are just too many other interesting things to do!
If some form of XMM was doable with HUB-exec and caching it could actually be quite a good combination. If the generated code knows about all its jump/calls crossing some "page" boundary (I know it is not true paging) it could possibly return to some page loader code that either jumps over to another cached page elsewhere in HUB or brings the next page in from HyperRAM and a whole bunch of these pages could be retained in the 512kB of HUB. With the largest burst transfers it might only take a few microseconds to bring in around 512-1kB bytes or so (128-256 instructions). For code that doesn't branch everywhere and mostly stays within its working set, performance could still be pretty decent. Smaller sized page transfers could also be used to speed up the loading rate at the expense of more inter-page branching. It could be tuned to see what page sizes work better.
Either HyperRAM or HyperFlash could be used for the program memory with the same amount of performance. Programs could grow as large as 32MB with the Flash on the P2-EVAL breakout. That's massive. I think this is worth some consideration once people get familiar with how the HyperRAM/Flash could be used here.
With video applications you could still share a video frame buffer in external memory with program memory and give the video driver priority over the caching VM loader. Performance can take a hit but it still could work. There's a lot of memory bandwidth to go around.
I wrote a P1 Fast Overlay loader back in 2008? Heater used it in ZiCog. It loads from hub in reverse ie last address first so that it hits every hub slot.
We used overlays back in the 70’s on the minis and I suspect even earlier on the mainframes. It was absolutely necessary in a 5KB core memory model (cog) with a shared (ie hub) of 10KB.
So I was able to find enough COGRAM and LUTRAM to squeeze in the read-modify-write stuff I was talking about recently. I think it should work out now. Unfortunately I had to shuffle around my EXECF sequences for reads quite a bit to make it all fit which is always a risk of breaking other stuff in pretty nasty ways, so I'll need to re-test a lot of this again.
Current situation:
COGRAM use 502 longs
LUTRAM use 512 longs
I might be able to free a few more COGRAM longs by sharing registers even more but that also gets risky and makes the code more fragile in any future changes, especially if you give the same register two different names for more clarity. You think you can freely re-use something but it then creates a side-effect somewhere else, and the more branching and skipf code paths etc, the harder this stuff gets to track.
I've implemented setting of CR0 myself now. In the process of testing out more combinations I just bumped into confirmation of Von's assertion that P16-P31 pins are more evenly routed than the others. Read data testing doesn't seem to be affected but sysclock/1 writes definitely are. This is of course the most hairy of my setups with the 22 nF capacitor on the HR clock pin.
What sort of write timing variation have you found as you change CR0 impedance @evanh? The results posted above are for the lowest impedance 19 ohm drive value for cr0 ($ff1f) by the looks of it. I wonder how much that should EDIT: (actually affect) write timing if this bus is tri-stated? Do we have other values to compare this with?
Roger,
Registering the HR clkpin for data reads definitely gives a higher usable clock speed. And setting the CR0 drive strength to 19 ohms does help a little too. I'm getting over 320 MT/s read speed now. On the down side, the 22 pF capacitor definitely drags it down. A dedicated board layout will help a lot.
What sort of write timing variation have you found as you change CR0 impedance @evanh? The results posted above are for the lowest impedance 19 ohm drive value for cr0 ($ff1f) by the looks of it. I wonder how much that should actual write timing if this bus is tri-stated? Do we have other values to compare this with?
That's with the capacitor in place. It's only a demo of the differences with something sensitive.
EDIT: ie: The difference between a basepin of 16 and 32 is tiny and generally doesn't impact reliability.
This information on keeping routing well matched (and in general short) should be useful for @"Peter Jakacki" and his HyperRAM add on for P2D2. His design uses P32-P39 for the data bus but there is nothing bad about those P2 pins in general, just that on the P2-EVAL board their trace lengths are not quite as evening matched to the header pins vs other ports which may compromise a future sysclk/1 write operation. I still use P32 all the time with sysclk/1 reads.
.... This is of course the most hairy of my setups with the 22 nF capacitor on the HR clock pin.
I think you meant 22pF there ?
Oops, lol, yeah, 22 pF.
Did you try a trimmer for that skew-cap ? Now there are nice error tables, there may be a better C value ?
I was quite happy with the measured result on the scope. The 22 pF brought the slope nicely parallel to the data traces and the roughly 1 ns lag was just what I wanted.
There was clear case of attenuation kicking in though. The board layout has a lot of capacitance I suspect. I don't think any signal improvement can be made without a dedicated snug hyperRAM on the prop2 board. Even then, I worry that adding a capacitor will be a major nerf, so want to have the space to modify the first experiment board.
Sadly I don't have any layout even on the drawing board yet.
This information on keeping routing well matched (and in general short) should be useful for @"Peter Jakacki" and his HyperRAM add on for P2D2. His design uses P32-P39 for the data bus but there is nothing bad about those P2 pins in general, just that on the P2-EVAL board their trace lengths are not quite as evening matched to the header pins vs other ports which may compromise a future sysclk/1 write operation. I still use P32 all the time with sysclk/1 reads.
Correct, it's the board causing it, not the chip.
On the other hand, I still favour keeping P28-P31 and associated VIO away from any connectors. If that VIO is taken out then the prop2 is bricked because the sysclock oscillators won't power up without it.
Just a quick question
Is the hyperram rated for the clock speeds you are using?
The data rates you are achieving top mine in a complete
Different project by a good number
Nope. We're using Parallax's accessory board, which is a 200 MT/s rated part (IS66WVH16M8BLL). Given how much better the HyperFlash has performed for Roger, maybe the faster Hyperbus2 parts will be a notable boost for us.
The HyperFlash is IS26KL256S-DABLI00, also 200 MT/s.
Surac there's a "version 2" of HyperRam coming from Infineon/Cypress, that goes faster and is rated to 400 MBps (=400 MT/s), in both 1v8 and 3v3 variants.
So far, the 1v8 parts of version 2 are available in stock, "S27KS0642*", but the 3v3 "S27KL0642*" versions are not yet in stock. Hopefully soon
When these parts arrive, we will be within spec again
Finally had a chance to get back onto this after a couple of weeks of being sidetracked with AVR micros, PS/2 keyboards and Z80 CRTCs of all things.
I tested out the read-modify-write feature I had recently added and it appears to work. This now lets us write arbitrary bitfields within individual bytes/words/longs and retrieve the prior value in a single mailbox operation.
It will be useful for semaphores, general bitfield updates and graphics pixel operations on any elements that differ from the native 8/16/32 bit storage sizes.
The existing single element access APIs were just these:
PUB readByte(addr) : r
PUB readWord(addr) : r
PUB readLong(addr) : r
PUB writeByte(addr, data) : r
PUB writeWord(addr, data) : r
PUB writeLong(addr, data) : r
And the updated API now includes these too:
PUB readModifyByte(addr, data, mask) : r
PUB readModifyWord(addr, data, mask) : r
PUB readModifyLong(addr, data, mask) : r
The mask is the same size as the element being written (8, 16, or 32 bits), and it's binary ones indicate the bit(s) in the data parameter at these mask bit position(s) that be written to the HyperRAM and overwrite the existing data bit value. When the mask bit zero the corresponding data bit is left alone.
The original value before any update gets applied is also returned by the API.
If the mask used is zero, no updates are applied (and this then defaults to the same behaviour as the general read case in the PASM driver).
I'm now down to 1 long free in COG RAM. That's it.
Most Hyper memory access code and per bank/cog state live there. There's zero longs left in LUTRAM! Any more code space will need to find optimisations or elimination of features.
Things are looking good with this final? version in my testing. Tonight I just got a video frame buffer output to my video driver from HyperFlash for the first time ever. So I can send my driver a frame buffer held in either RAM or Flash controlled just by the nominated address which is mapped to one of the devices on the bus. I should be able to graphics copy image data out of flash directly to RAM too, this could be useful for image resources for GUI elements etc. Just about to try that out once I write something useful for testing into flash.
I am very glad I decided to enable different input timing for each different bank back when I was contemplating doing all that. At one time I wasn't fully sure I would to need to do this and it also added some small setup overheads, but it would have been very hard to add it in at this stage. This issue just showed up already as a problem at 200MHz with my video frame buffer. HyperFlash needed a delay of 8 to show stable pixels while RAM wanted 9. This input delay is kept as a per bank parameter so the driver can handle the different input timing per each read access.
I just noticed something interesting with 16MB HyperRAM on the P2-EVAL during some re-testing. If you cross over from one stacked die to the other in the same HyperRAM multi-chip package (MCP) at the 8MB boundary during a read, the burst read will wrap around only with the starting die and not cross to the next die in the package. When I looked further at the data sheet, I found this was documented as:
5. When Linear Burst is selected by CA[45], the device cannot advance across to next die.
Here's a picture of what happens. I had a frame buffer starting at around 8MB-64kB (with unprovisioned memory but primarily purple colours) and when it crosses at the 8MB boundary, you start to see some colour patterns that I had written at address 0 in the first die in the package. After the scan line burst read completes at the boundary, the frame continues on within the second die (primarily green colours). Same thing happens when wrapping from 16MB back to zero, but the data read at the crossing will be that stored starting at 8MB.
I can't really do much about this now, it is a feature of the HyperRAM MCP itself. The only way we could deal with it would be to specifically test for a 8MB crossing within every burst read and split the bursts at the boundary (a little bit like I did for flash page crossings), but this is not worth the additional overhead on all read bursts and probably also all other fills/copies etc. So to avoid this it is best to just keep your frame buffers (or other data elements) to be fully contained within the same 8MB block if you use an MCP based HyperRAM device like the one on the Parallax HyperRAM module. Future single die HyperRAM devices will probably not have this issue anyway.
Recently I was just testing the round robin scheduling in this HyperRAM driver and noticed something "interesting". As a test I had 5 COGs competing for HyperRAM access and wanted to see how many requests each one obtains relative to the others to compare the fairness.
Each round-robin (RR) COG does the exact same operation drawing some vertical lines into the frame buffer moving from left to right (covering 1920 pixels wide on a FullHD screen), cycling the colour when it reaches the end and wraps around again to the left side. These operations form a colour row or bar per COG on the screen and one strict priority video COG outputs this screen over VGA. If a round-robin COG is getting more requests serviced compared to others its bar advances faster relative to the others and it looks like a race visually with some less serviced COGs being "lapped" making this nice and easy to see in real time. The request servicing over these 5 RR COGs looks good like this and they advance at pretty much the same speed. I count the requests and take the average as a percentage of the total request count and show this at the top left of the screen per COG.
So when all RR COGs do the same thing and request the same operation taking the same duration, it is fair and there is minimal separation between the different COG's bars. However if one COG is then stopped and becomes inactive, its requests cease and the fairness changes. Instead of being equally allocated to the other 4 RR COGs, one COG is given an advantage and the request share then looks more like this (apologies for the blurry shot):
I found the issue is the way these RR COGs are polled. Each time a request is serviced the RR COG polling order advances, like this:
Initially: COG0, COG1, COG2, COG3, COG4
next request: COG1, 2, 3, 4, 0
next request: 2, 3, 4, 0, 1
next request: 3, 4, 0, 1, 2
next request: 4, 0, 1, 2, 3
next request: 0, 1, 2, 3, 4 (continuing etc)
This works fairly when all COGs are active. They have an equal time at the first spot, 2nd spot, 3rd, 4th, and last spot. The problem happens if COG4 is inactive, because the next in line is COG0. This essentially gives COG0 two goes at the top spot because COG4 never needs servicing. For 2 in every 5 requests serviced, COG0 gets polled before the others. To fix this requires another more complicated implementation where you select each RR COG only once per polling loop iteration, or somehow determine the full polling order more randomly so inactive COGs aren't followed by the same COG. I think a polling change like that might have to come later if at all. There is a tradeoff between complexity and polling latency here. My current implementation keeps the polling as simple as possible to try to be as fast as it can be. Currently the loop generated for 5 RR COGs and one strict priority video COG would be the one shown below. It builds a skip mask to determine the polling sequence.
poller
incmod rrcounter, #4 'cycle the round-robin (RR) counter
bmask mask, rrcounter 'generate a RR skip mask from the count
shl mask, #1 'don't skip first instruction in skip mask
repcount rep #10, #0 'repeat until we get a request for something
setq #24-1 'read 24 longs
rdlong req0, mbox 'get all mailbox requests and data longs
tjs req7, cog7_handler ' (video COG for example)
polling_code skipf mask ']dyanmic polling code starts from here....
jatn atn_handler ']JATN triggers reconfiguration
tjs req0, cog0_handler ']
tjs req1, cog1_handler ']
tjs req2, cog2_handler ']
tjs req3, cog3_handler ']
tjs req4, cog4_handler '] Loop is generated based on
tjs req0, cog0_handler '] number of RR COGs
tjs req1, cog1_handler ']
tjs req2, cog2_handler ']
tjs req3, cog3_handler ']
tjs req4, cog4_handler ']
I think to work around this polling issue when RR COGs are all doing the same thing and true fairness is needed, it is best to only enable COGs in the RR polling loop that will actually be active and remove any others already in there by default.
When requests are randomly arriving this should be far less of a problem. It's mainly happening when they are all doing the same thing at the same rate, and some COG(s) are idle. I noticed if I add a random delay to each RR client COG after it's request completes the fairness starts to return.
I wish I had something useful to contribute, but man, what a great visual aide/tool...really neat idea. Coming up with tools like this really help diagnose problems, letting you really see them. Sometimes it takes something like this to get to that "Ah! I know what's wrong now" moment.
Yeah this approach helped me encounter the issue visually and it was good to use it to prove out the strict priority COG polling setting as well. In that case the bar of the highest priority COG (after video) screams along the fastest, then the next priority COG's bar, the third priority bar is pretty slow to move, and the fourth bar barely moves at all. Definitely strict priority as intended there.
For some time I thought I must have had a bug in the code causing this type of unfairness, but in the end it was just the polling design's own limitation. I needed to write it down and really think about the effect of the polling sequence and what happens when skipping inactive COGs. It would be cool to come up with a fast scheme that somehow improves on this and keeps it fair for equal load even when some COGs are idle and which still fits in the existing COGRAM footprint. If it doesn't fit the space then any change to that poller will have to wait until I free more COGRAM by changing the table lookup method I use and would then add four extra cycles of overhead per request. Doing that can wait for another time though....it's not worth all the extra work right now. I want to get it out, it's working very well already.
I was hoping I might be able to reorder the HyperRAM driver's mailbox parameter order to increase performance slightly for PASM clients. Right now the mailbox parameter order is this:
mailboxBase + 0: request & bank/external address
mailboxBase + 4: read/write data or read/write hub address for bursts
mailboxBase + 8: mask or transfer count
If I reverse this order it then has the request long written last which triggers the memory request and a SETQ #2 followed by a WRLONG is a safe and fast way to generate memory requests even with the fifo running as it might be in video driver clients. The existing problem is that any fifo use can interfere with the SETQ read/write bursts and introduce gaps between longs transferred and potentially cause a request to be triggered prematurely with stale data parameters in some mailbox registers. My own video driver client works around this issue by writing the second two mailbox longs first (with a SETQ #1 burst), then the writing to request mailbox long separately after that, but doing this slows down the initial request a little bit. Changing this order would improve that side of things. I also thought that it may let the polling sequence that reads the result and the status be tightened to something like this sample below which would also allow us to use the flags from the last long of the read burst to detect the poll exit condition, however there is still a problem...
mov addr, ##$ABCDE
call #readlong ' read HyperRAM memory
' data reg now contains the result
...
' readlong:
' input addr - external address to read from (trashed afterwards)
' output data - result
readlong
mov mask, #0 ' set mask = 0 to prevent read-modify-write cycle
setbyte addr, #READLONG, #3 ' setup 32 bit read request using external address
setq #3-1 ' writing 3 longs
wrlong mask, mailboxPtr ' trigger mailbox request
rep #3, #0 ' poll for result (15 clock cycle rep loop once aligned to hub)
setq #3-1 ' reading 3 longs (or possibly just two if you had a second mailboxPtr)
rdlong mask, mailboxPtr wcz ' read memory data and polling status
if_nc ret wcz ' need to check with evanh if you can really return from the end of a rep loop
mask long 0
data long 0
addr long 0
mailboxPtr long MAILBOXADDR
The new problem is that if the final read burst of the data result+status itself is interrupted by a fifo transfer on the client COG between reading the data and the polling status, you might have stale data read into the data result long, you'd need to read it again after the REP loop completes if you ever use the fifo during the polling operation. So the change of order helps one side but hinders the other side. We sort of want to keep the existing order on polling for the result to prevent this problem. We can't really fix both ends.
It would be a fairly simple change in the PASM driver to reorder the mailbox but the SPIN layer which abstracts this ordering needs to be changed in lots of places (still mostly straightforward). If it is going to happen at all I think it's worth doing now before the code is released because changing it later will affect all the PASM clients.
I'll need to mull this over and I'm really on the fence about doing it now it introduces new problems...any suggestions?
After thinking through a bit more I should just keep the original mailbox order as is. The result polling side should be the one that is optimized given rdlong's typically take longer than wrlongs to execute. Also the fifo use is a special case, and not all PASM clients will need that so they can still use a 3 wrlong burst to setup the mailbox whenever they don't use the fifo, even using the way it works today. With any luck doing this request setup sequence won't add too many extra total clocks anyway for the second WRLONG.
And the read result polling can still exit the mailbox polling loop with the data quickly this way without needing the flags read:
POP exitaddr ' pop return address off the stack
REP #3, #0 ' repeat until result is ready
SETQ #2-1 ' read data and status
RDLONG addr, mailboxptr1
TJNS addr, exitaddr ' returns to caller when request is serviced
@evanh In the earlier post the existing mailbox long order was shown but I was just contemplating changing it to enable the request setup/polling according to the sample code provided. However in the end I've decided against changing it as it still introduces a problem on the read side even though it can improve writing the request.
Had a question you might be able to answer with your knowledge of the egg beater timing. In this code sequence:
How long would the second WRLONG take if mailboxPtr1 = mailboxPtr2 - 4 ?
I know that WRLONGs take anywhere from 3-10 clocks, but I'm hoping it might be on the shorter side of that range when it follows the first WRLONG which will already sync up to the egg beater.
... but I'm hoping it might be on the shorter side of that range when it follows the first WRLONG which will already sync up to the egg beater.
It is a determinable amount of sysclocks but it depends the modulo'd difference between the final address of the burst and address of mailboxPrt1. If they both shift in unison, ie: the delta doesn't change, then you're in luck.
EDIT: Basically, if you can arrange addresses of burstEnd % totalCogs == (mailboxPrt1 - 3) % totalCogs then you should achieve optimal WRLONG of 3 sysclock.
EDIT2: I don't think I'd worry about it with a 4-cog prop2, but a 16-cog prop2 would definitely want to use this.
Comments
No point in asking me - I have no idea how XMM is implemented in GCC. I know Steve Densen did some of the original work on the cache which (I believe) both GCC and Catalina use to support XMM where the RAM itself is too slow to access directly (e.g. SRAM) - so you are probably right that they are basically similar.
I am still trying to come to grips with whether it is worth implementing code execution from XMM on the P2. I keep wavering between doing a trivial "P1" type port - which would be very easy but suffer from the same problems as XMM code execution has on the P1 (i.e. that it is quite slow) - or doing something more sophisticated.
The problem is that I really have no idea yet whether XMM will be widely used on the P2. Even on the P1 (which absolutely needed it to execute large programs) it didn't see very much use outside us hardcore fanatics.
I will probably continue to waver on this for a bit yet. There are just too many other interesting things to do!
Either HyperRAM or HyperFlash could be used for the program memory with the same amount of performance. Programs could grow as large as 32MB with the Flash on the P2-EVAL breakout. That's massive. I think this is worth some consideration once people get familiar with how the HyperRAM/Flash could be used here.
With video applications you could still share a video frame buffer in external memory with program memory and give the video driver priority over the caching VM loader. Performance can take a hit but it still could work. There's a lot of memory bandwidth to go around.
We used overlays back in the 70’s on the minis and I suspect even earlier on the mainframes. It was absolutely necessary in a 5KB core memory model (cog) with a shared (ie hub) of 10KB.
Current situation:
COGRAM use 502 longs
LUTRAM use 512 longs
I might be able to free a few more COGRAM longs by sharing registers even more but that also gets risky and makes the code more fragile in any future changes, especially if you give the same register two different names for more clarity. You think you can freely re-use something but it then creates a side-effect somewhere else, and the more branching and skipf code paths etc, the harder this stuff gets to track.
Here's pins P32-P47:
Registering the HR clkpin for data reads definitely gives a higher usable clock speed. And setting the CR0 drive strength to 19 ohms does help a little too. I'm getting over 320 MT/s read speed now. On the down side, the 22 pF capacitor definitely drags it down. A dedicated board layout will help a lot.
EDIT: ie: The difference between a basepin of 16 and 32 is tiny and generally doesn't impact reliability.
Did you try a trimmer for that skew-cap ? Now there are nice error tables, there may be a better C value ?
I was quite happy with the measured result on the scope. The 22 pF brought the slope nicely parallel to the data traces and the roughly 1 ns lag was just what I wanted.
There was clear case of attenuation kicking in though. The board layout has a lot of capacitance I suspect. I don't think any signal improvement can be made without a dedicated snug hyperRAM on the prop2 board. Even then, I worry that adding a capacitor will be a major nerf, so want to have the space to modify the first experiment board.
Sadly I don't have any layout even on the drawing board yet.
On the other hand, I still favour keeping P28-P31 and associated VIO away from any connectors. If that VIO is taken out then the prop2 is bricked because the sysclock oscillators won't power up without it.
Is the hyperram rated for the clock speeds you are using?
The data rates you are achieving top mine in a complete
Different project by a good number
Best regards
The HyperFlash is IS26KL256S-DABLI00, also 200 MT/s.
So far, the 1v8 parts of version 2 are available in stock, "S27KS0642*", but the 3v3 "S27KL0642*" versions are not yet in stock. Hopefully soon
When these parts arrive, we will be within spec again
I tested out the read-modify-write feature I had recently added and it appears to work. This now lets us write arbitrary bitfields within individual bytes/words/longs and retrieve the prior value in a single mailbox operation.
It will be useful for semaphores, general bitfield updates and graphics pixel operations on any elements that differ from the native 8/16/32 bit storage sizes.
The existing single element access APIs were just these:
And the updated API now includes these too:
The mask is the same size as the element being written (8, 16, or 32 bits), and it's binary ones indicate the bit(s) in the data parameter at these mask bit position(s) that be written to the HyperRAM and overwrite the existing data bit value. When the mask bit zero the corresponding data bit is left alone.
The original value before any update gets applied is also returned by the API.
If the mask used is zero, no updates are applied (and this then defaults to the same behaviour as the general read case in the PASM driver).
I'm now down to 1 long free in COG RAM. That's it.
Things are looking good with this final? version in my testing. Tonight I just got a video frame buffer output to my video driver from HyperFlash for the first time ever. So I can send my driver a frame buffer held in either RAM or Flash controlled just by the nominated address which is mapped to one of the devices on the bus. I should be able to graphics copy image data out of flash directly to RAM too, this could be useful for image resources for GUI elements etc. Just about to try that out once I write something useful for testing into flash.
I am very glad I decided to enable different input timing for each different bank back when I was contemplating doing all that. At one time I wasn't fully sure I would to need to do this and it also added some small setup overheads, but it would have been very hard to add it in at this stage. This issue just showed up already as a problem at 200MHz with my video frame buffer. HyperFlash needed a delay of 8 to show stable pixels while RAM wanted 9. This input delay is kept as a per bank parameter so the driver can handle the different input timing per each read access.
5. When Linear Burst is selected by CA[45], the device cannot advance across to next die.
Here's a picture of what happens. I had a frame buffer starting at around 8MB-64kB (with unprovisioned memory but primarily purple colours) and when it crosses at the 8MB boundary, you start to see some colour patterns that I had written at address 0 in the first die in the package. After the scan line burst read completes at the boundary, the frame continues on within the second die (primarily green colours). Same thing happens when wrapping from 16MB back to zero, but the data read at the crossing will be that stored starting at 8MB.
I can't really do much about this now, it is a feature of the HyperRAM MCP itself. The only way we could deal with it would be to specifically test for a 8MB crossing within every burst read and split the bursts at the boundary (a little bit like I did for flash page crossings), but this is not worth the additional overhead on all read bursts and probably also all other fills/copies etc. So to avoid this it is best to just keep your frame buffers (or other data elements) to be fully contained within the same 8MB block if you use an MCP based HyperRAM device like the one on the Parallax HyperRAM module. Future single die HyperRAM devices will probably not have this issue anyway.
Each round-robin (RR) COG does the exact same operation drawing some vertical lines into the frame buffer moving from left to right (covering 1920 pixels wide on a FullHD screen), cycling the colour when it reaches the end and wraps around again to the left side. These operations form a colour row or bar per COG on the screen and one strict priority video COG outputs this screen over VGA. If a round-robin COG is getting more requests serviced compared to others its bar advances faster relative to the others and it looks like a race visually with some less serviced COGs being "lapped" making this nice and easy to see in real time. The request servicing over these 5 RR COGs looks good like this and they advance at pretty much the same speed. I count the requests and take the average as a percentage of the total request count and show this at the top left of the screen per COG.
So when all RR COGs do the same thing and request the same operation taking the same duration, it is fair and there is minimal separation between the different COG's bars. However if one COG is then stopped and becomes inactive, its requests cease and the fairness changes. Instead of being equally allocated to the other 4 RR COGs, one COG is given an advantage and the request share then looks more like this (apologies for the blurry shot):
COG0 40%
COG1 20%
COG2 20%
COG3 20%
COG4 0% (inactive)
I found the issue is the way these RR COGs are polled. Each time a request is serviced the RR COG polling order advances, like this:
Initially: COG0, COG1, COG2, COG3, COG4
next request: COG1, 2, 3, 4, 0
next request: 2, 3, 4, 0, 1
next request: 3, 4, 0, 1, 2
next request: 4, 0, 1, 2, 3
next request: 0, 1, 2, 3, 4 (continuing etc)
This works fairly when all COGs are active. They have an equal time at the first spot, 2nd spot, 3rd, 4th, and last spot. The problem happens if COG4 is inactive, because the next in line is COG0. This essentially gives COG0 two goes at the top spot because COG4 never needs servicing. For 2 in every 5 requests serviced, COG0 gets polled before the others. To fix this requires another more complicated implementation where you select each RR COG only once per polling loop iteration, or somehow determine the full polling order more randomly so inactive COGs aren't followed by the same COG. I think a polling change like that might have to come later if at all. There is a tradeoff between complexity and polling latency here. My current implementation keeps the polling as simple as possible to try to be as fast as it can be. Currently the loop generated for 5 RR COGs and one strict priority video COG would be the one shown below. It builds a skip mask to determine the polling sequence.
I think to work around this polling issue when RR COGs are all doing the same thing and true fairness is needed, it is best to only enable COGs in the RR polling loop that will actually be active and remove any others already in there by default.
When requests are randomly arriving this should be far less of a problem. It's mainly happening when they are all doing the same thing at the same rate, and some COG(s) are idle. I noticed if I add a random delay to each RR client COG after it's request completes the fairness starts to return.
For some time I thought I must have had a bug in the code causing this type of unfairness, but in the end it was just the polling design's own limitation. I needed to write it down and really think about the effect of the polling sequence and what happens when skipping inactive COGs. It would be cool to come up with a fast scheme that somehow improves on this and keeps it fair for equal load even when some COGs are idle and which still fits in the existing COGRAM footprint. If it doesn't fit the space then any change to that poller will have to wait until I free more COGRAM by changing the table lookup method I use and would then add four extra cycles of overhead per request. Doing that can wait for another time though....it's not worth all the extra work right now. I want to get it out, it's working very well already.
mailboxBase + 0: request & bank/external address
mailboxBase + 4: read/write data or read/write hub address for bursts
mailboxBase + 8: mask or transfer count
If I reverse this order it then has the request long written last which triggers the memory request and a SETQ #2 followed by a WRLONG is a safe and fast way to generate memory requests even with the fifo running as it might be in video driver clients. The existing problem is that any fifo use can interfere with the SETQ read/write bursts and introduce gaps between longs transferred and potentially cause a request to be triggered prematurely with stale data parameters in some mailbox registers. My own video driver client works around this issue by writing the second two mailbox longs first (with a SETQ #1 burst), then the writing to request mailbox long separately after that, but doing this slows down the initial request a little bit. Changing this order would improve that side of things. I also thought that it may let the polling sequence that reads the result and the status be tightened to something like this sample below which would also allow us to use the flags from the last long of the read burst to detect the poll exit condition, however there is still a problem...
The new problem is that if the final read burst of the data result+status itself is interrupted by a fifo transfer on the client COG between reading the data and the polling status, you might have stale data read into the data result long, you'd need to read it again after the REP loop completes if you ever use the fifo during the polling operation. So the change of order helps one side but hinders the other side. We sort of want to keep the existing order on polling for the result to prevent this problem. We can't really fix both ends.
It would be a fairly simple change in the PASM driver to reorder the mailbox but the SPIN layer which abstracts this ordering needs to be changed in lots of places (still mostly straightforward). If it is going to happen at all I think it's worth doing now before the code is released because changing it later will affect all the PASM clients.
I'll need to mull this over and I'm really on the fence about doing it now it introduces new problems...any suggestions?
And the read result polling can still exit the mailbox polling loop with the data quickly this way without needing the flags read:
I'm going to leave the order alone.
Had a question you might be able to answer with your knowledge of the egg beater timing. In this code sequence: How long would the second WRLONG take if mailboxPtr1 = mailboxPtr2 - 4 ?
I know that WRLONGs take anywhere from 3-10 clocks, but I'm hoping it might be on the shorter side of that range when it follows the first WRLONG which will already sync up to the egg beater.
EDIT: Basically, if you can arrange addresses of burstEnd % totalCogs == (mailboxPrt1 - 3) % totalCogs then you should achieve optimal WRLONG of 3 sysclock.
EDIT2: I don't think I'd worry about it with a 4-cog prop2, but a 16-cog prop2 would definitely want to use this.