The usual P1 XMM model (as implemented in GCC and probably similiar in Catalina, but IDK ask @RossH ) ...
No point in asking me - I have no idea how XMM is implemented in GCC. I know Steve Densen did some of the original work on the cache which (I believe) both GCC and Catalina use to support XMM where the RAM itself is too slow to access directly (e.g. SRAM) - so you are probably right that they are basically similar.
I am still trying to come to grips with whether it is worth implementing code execution from XMM on the P2. I keep wavering between doing a trivial "P1" type port - which would be very easy but suffer from the same problems as XMM code execution has on the P1 (i.e. that it is quite slow) - or doing something more sophisticated.
The problem is that I really have no idea yet whether XMM will be widely used on the P2. Even on the P1 (which absolutely needed it to execute large programs) it didn't see very much use outside us hardcore fanatics.
I will probably continue to waver on this for a bit yet. There are just too many other interesting things to do!
If some form of XMM was doable with HUB-exec and caching it could actually be quite a good combination. If the generated code knows about all its jump/calls crossing some "page" boundary (I know it is not true paging) it could possibly return to some page loader code that either jumps over to another cached page elsewhere in HUB or brings the next page in from HyperRAM and a whole bunch of these pages could be retained in the 512kB of HUB. With the largest burst transfers it might only take a few microseconds to bring in around 512-1kB bytes or so (128-256 instructions). For code that doesn't branch everywhere and mostly stays within its working set, performance could still be pretty decent. Smaller sized page transfers could also be used to speed up the loading rate at the expense of more inter-page branching. It could be tuned to see what page sizes work better.
Either HyperRAM or HyperFlash could be used for the program memory with the same amount of performance. Programs could grow as large as 32MB with the Flash on the P2-EVAL breakout. That's massive. I think this is worth some consideration once people get familiar with how the HyperRAM/Flash could be used here.
With video applications you could still share a video frame buffer in external memory with program memory and give the video driver priority over the caching VM loader. Performance can take a hit but it still could work. There's a lot of memory bandwidth to go around.
I wrote a P1 Fast Overlay loader back in 2008? Heater used it in ZiCog. It loads from hub in reverse ie last address first so that it hits every hub slot.
We used overlays back in the 70’s on the minis and I suspect even earlier on the mainframes. It was absolutely necessary in a 5KB core memory model (cog) with a shared (ie hub) of 10KB.
So I was able to find enough COGRAM and LUTRAM to squeeze in the read-modify-write stuff I was talking about recently. I think it should work out now. Unfortunately I had to shuffle around my EXECF sequences for reads quite a bit to make it all fit which is always a risk of breaking other stuff in pretty nasty ways, so I'll need to re-test a lot of this again.
Current situation:
COGRAM use 502 longs
LUTRAM use 512 longs
I might be able to free a few more COGRAM longs by sharing registers even more but that also gets risky and makes the code more fragile in any future changes, especially if you give the same register two different names for more clarity. You think you can freely re-use something but it then creates a side-effect somewhere else, and the more branching and skipf code paths etc, the harder this stuff gets to track.
I've implemented setting of CR0 myself now. In the process of testing out more combinations I just bumped into confirmation of Von's assertion that P16-P31 pins are more evenly routed than the others. Read data testing doesn't seem to be affected but sysclock/1 writes definitely are. This is of course the most hairy of my setups with the 22 nF capacitor on the HR clock pin.
What sort of write timing variation have you found as you change CR0 impedance @evanh? The results posted above are for the lowest impedance 19 ohm drive value for cr0 ($ff1f) by the looks of it. I wonder how much that should EDIT: (actually affect) write timing if this bus is tri-stated? Do we have other values to compare this with?
Roger,
Registering the HR clkpin for data reads definitely gives a higher usable clock speed. And setting the CR0 drive strength to 19 ohms does help a little too. I'm getting over 320 MT/s read speed now. On the down side, the 22 pF capacitor definitely drags it down. A dedicated board layout will help a lot.
What sort of write timing variation have you found as you change CR0 impedance @evanh? The results posted above are for the lowest impedance 19 ohm drive value for cr0 ($ff1f) by the looks of it. I wonder how much that should actual write timing if this bus is tri-stated? Do we have other values to compare this with?
That's with the capacitor in place. It's only a demo of the differences with something sensitive.
EDIT: ie: The difference between a basepin of 16 and 32 is tiny and generally doesn't impact reliability.
This information on keeping routing well matched (and in general short) should be useful for @"Peter Jakacki" and his HyperRAM add on for P2D2. His design uses P32-P39 for the data bus but there is nothing bad about those P2 pins in general, just that on the P2-EVAL board their trace lengths are not quite as evening matched to the header pins vs other ports which may compromise a future sysclk/1 write operation. I still use P32 all the time with sysclk/1 reads.
.... This is of course the most hairy of my setups with the 22 nF capacitor on the HR clock pin.
I think you meant 22pF there ?
Oops, lol, yeah, 22 pF.
Did you try a trimmer for that skew-cap ? Now there are nice error tables, there may be a better C value ?
I was quite happy with the measured result on the scope. The 22 pF brought the slope nicely parallel to the data traces and the roughly 1 ns lag was just what I wanted.
There was clear case of attenuation kicking in though. The board layout has a lot of capacitance I suspect. I don't think any signal improvement can be made without a dedicated snug hyperRAM on the prop2 board. Even then, I worry that adding a capacitor will be a major nerf, so want to have the space to modify the first experiment board.
Sadly I don't have any layout even on the drawing board yet.
This information on keeping routing well matched (and in general short) should be useful for @"Peter Jakacki" and his HyperRAM add on for P2D2. His design uses P32-P39 for the data bus but there is nothing bad about those P2 pins in general, just that on the P2-EVAL board their trace lengths are not quite as evening matched to the header pins vs other ports which may compromise a future sysclk/1 write operation. I still use P32 all the time with sysclk/1 reads.
Correct, it's the board causing it, not the chip.
On the other hand, I still favour keeping P28-P31 and associated VIO away from any connectors. If that VIO is taken out then the prop2 is bricked because the sysclock oscillators won't power up without it.
Just a quick question
Is the hyperram rated for the clock speeds you are using?
The data rates you are achieving top mine in a complete
Different project by a good number
Nope. We're using Parallax's accessory board, which is a 200 MT/s rated part (IS66WVH16M8BLL). Given how much better the HyperFlash has performed for Roger, maybe the faster Hyperbus2 parts will be a notable boost for us.
The HyperFlash is IS26KL256S-DABLI00, also 200 MT/s.
Surac there's a "version 2" of HyperRam coming from Infineon/Cypress, that goes faster and is rated to 400 MBps (=400 MT/s), in both 1v8 and 3v3 variants.
So far, the 1v8 parts of version 2 are available in stock, "S27KS0642*", but the 3v3 "S27KL0642*" versions are not yet in stock. Hopefully soon
When these parts arrive, we will be within spec again
Finally had a chance to get back onto this after a couple of weeks of being sidetracked with AVR micros, PS/2 keyboards and Z80 CRTCs of all things.
I tested out the read-modify-write feature I had recently added and it appears to work. This now lets us write arbitrary bitfields within individual bytes/words/longs and retrieve the prior value in a single mailbox operation.
It will be useful for semaphores, general bitfield updates and graphics pixel operations on any elements that differ from the native 8/16/32 bit storage sizes.
The existing single element access APIs were just these:
PUBreadByte(addr) : rPUBreadWord(addr) : rPUBreadLong(addr) : rPUBwriteByte(addr, data) : r PUBwriteWord(addr, data) : r PUBwriteLong(addr, data) : r
The mask is the same size as the element being written (8, 16, or 32 bits), and it's binary ones indicate the bit(s) in the data parameter at these mask bit position(s) that be written to the HyperRAM and overwrite the existing data bit value. When the mask bit zero the corresponding data bit is left alone.
The original value before any update gets applied is also returned by the API.
If the mask used is zero, no updates are applied (and this then defaults to the same behaviour as the general read case in the PASM driver).
I'm now down to 1 long free in COG RAM. That's it.
Most Hyper memory access code and per bank/cog state live there. There's zero longs left in LUTRAM! Any more code space will need to find optimisations or elimination of features.
Things are looking good with this final? version in my testing. Tonight I just got a video frame buffer output to my video driver from HyperFlash for the first time ever. So I can send my driver a frame buffer held in either RAM or Flash controlled just by the nominated address which is mapped to one of the devices on the bus. I should be able to graphics copy image data out of flash directly to RAM too, this could be useful for image resources for GUI elements etc. Just about to try that out once I write something useful for testing into flash.
I am very glad I decided to enable different input timing for each different bank back when I was contemplating doing all that. At one time I wasn't fully sure I would to need to do this and it also added some small setup overheads, but it would have been very hard to add it in at this stage. This issue just showed up already as a problem at 200MHz with my video frame buffer. HyperFlash needed a delay of 8 to show stable pixels while RAM wanted 9. This input delay is kept as a per bank parameter so the driver can handle the different input timing per each read access.
I just noticed something interesting with 16MB HyperRAM on the P2-EVAL during some re-testing. If you cross over from one stacked die to the other in the same HyperRAM multi-chip package (MCP) at the 8MB boundary during a read, the burst read will wrap around only with the starting die and not cross to the next die in the package. When I looked further at the data sheet, I found this was documented as:
5. When Linear Burst is selected by CA[45], the device cannot advance across to next die.
Here's a picture of what happens. I had a frame buffer starting at around 8MB-64kB (with unprovisioned memory but primarily purple colours) and when it crosses at the 8MB boundary, you start to see some colour patterns that I had written at address 0 in the first die in the package. After the scan line burst read completes at the boundary, the frame continues on within the second die (primarily green colours). Same thing happens when wrapping from 16MB back to zero, but the data read at the crossing will be that stored starting at 8MB.
I can't really do much about this now, it is a feature of the HyperRAM MCP itself. The only way we could deal with it would be to specifically test for a 8MB crossing within every burst read and split the bursts at the boundary (a little bit like I did for flash page crossings), but this is not worth the additional overhead on all read bursts and probably also all other fills/copies etc. So to avoid this it is best to just keep your frame buffers (or other data elements) to be fully contained within the same 8MB block if you use an MCP based HyperRAM device like the one on the Parallax HyperRAM module. Future single die HyperRAM devices will probably not have this issue anyway.
Recently I was just testing the round robin scheduling in this HyperRAM driver and noticed something "interesting". As a test I had 5 COGs competing for HyperRAM access and wanted to see how many requests each one obtains relative to the others to compare the fairness.
Each round-robin (RR) COG does the exact same operation drawing some vertical lines into the frame buffer moving from left to right (covering 1920 pixels wide on a FullHD screen), cycling the colour when it reaches the end and wraps around again to the left side. These operations form a colour row or bar per COG on the screen and one strict priority video COG outputs this screen over VGA. If a round-robin COG is getting more requests serviced compared to others its bar advances faster relative to the others and it looks like a race visually with some less serviced COGs being "lapped" making this nice and easy to see in real time. The request servicing over these 5 RR COGs looks good like this and they advance at pretty much the same speed. I count the requests and take the average as a percentage of the total request count and show this at the top left of the screen per COG.
So when all RR COGs do the same thing and request the same operation taking the same duration, it is fair and there is minimal separation between the different COG's bars. However if one COG is then stopped and becomes inactive, its requests cease and the fairness changes. Instead of being equally allocated to the other 4 RR COGs, one COG is given an advantage and the request share then looks more like this (apologies for the blurry shot):
I found the issue is the way these RR COGs are polled. Each time a request is serviced the RR COG polling order advances, like this:
Initially: COG0, COG1, COG2, COG3, COG4
next request: COG1, 2, 3, 4, 0
next request: 2, 3, 4, 0, 1
next request: 3, 4, 0, 1, 2
next request: 4, 0, 1, 2, 3
next request: 0, 1, 2, 3, 4 (continuing etc)
This works fairly when all COGs are active. They have an equal time at the first spot, 2nd spot, 3rd, 4th, and last spot. The problem happens if COG4 is inactive, because the next in line is COG0. This essentially gives COG0 two goes at the top spot because COG4 never needs servicing. For 2 in every 5 requests serviced, COG0 gets polled before the others. To fix this requires another more complicated implementation where you select each RR COG only once per polling loop iteration, or somehow determine the full polling order more randomly so inactive COGs aren't followed by the same COG. I think a polling change like that might have to come later if at all. There is a tradeoff between complexity and polling latency here. My current implementation keeps the polling as simple as possible to try to be as fast as it can be. Currently the loop generated for 5 RR COGs and one strict priority video COG would be the one shown below. It builds a skip mask to determine the polling sequence.
poller
incmod rrcounter, #4'cycle the round-robin (RR) counterbmask mask, rrcounter 'generate a RR skip mask from the countshl mask, #1'don't skip first instruction in skip mask
repcount rep #10, #0'repeat until we get a request for somethingsetq #24-1'read 24 longsrdlong req0, mbox 'get all mailbox requests and data longstjs req7, cog7_handler ' (video COG for example)
polling_code skipf mask ']dyanmic polling code starts from here....jatn atn_handler ']JATN triggers reconfiguration tjs req0, cog0_handler ']tjs req1, cog1_handler ']tjs req2, cog2_handler ']tjs req3, cog3_handler ']tjs req4, cog4_handler '] Loop is generated based ontjs req0, cog0_handler '] number of RR COGstjs req1, cog1_handler ']tjs req2, cog2_handler ']tjs req3, cog3_handler ']tjs req4, cog4_handler ']
I think to work around this polling issue when RR COGs are all doing the same thing and true fairness is needed, it is best to only enable COGs in the RR polling loop that will actually be active and remove any others already in there by default.
When requests are randomly arriving this should be far less of a problem. It's mainly happening when they are all doing the same thing at the same rate, and some COG(s) are idle. I noticed if I add a random delay to each RR client COG after it's request completes the fairness starts to return.
I wish I had something useful to contribute, but man, what a great visual aide/tool...really neat idea. Coming up with tools like this really help diagnose problems, letting you really see them. Sometimes it takes something like this to get to that "Ah! I know what's wrong now" moment.
Yeah this approach helped me encounter the issue visually and it was good to use it to prove out the strict priority COG polling setting as well. In that case the bar of the highest priority COG (after video) screams along the fastest, then the next priority COG's bar, the third priority bar is pretty slow to move, and the fourth bar barely moves at all. Definitely strict priority as intended there.
For some time I thought I must have had a bug in the code causing this type of unfairness, but in the end it was just the polling design's own limitation. I needed to write it down and really think about the effect of the polling sequence and what happens when skipping inactive COGs. It would be cool to come up with a fast scheme that somehow improves on this and keeps it fair for equal load even when some COGs are idle and which still fits in the existing COGRAM footprint. If it doesn't fit the space then any change to that poller will have to wait until I free more COGRAM by changing the table lookup method I use and would then add four extra cycles of overhead per request. Doing that can wait for another time though....it's not worth all the extra work right now. I want to get it out, it's working very well already.
I was hoping I might be able to reorder the HyperRAM driver's mailbox parameter order to increase performance slightly for PASM clients. Right now the mailbox parameter order is this:
mailboxBase + 0: request & bank/external address
mailboxBase + 4: read/write data or read/write hub address for bursts
mailboxBase + 8: mask or transfer count
If I reverse this order it then has the request long written last which triggers the memory request and a SETQ #2 followed by a WRLONG is a safe and fast way to generate memory requests even with the fifo running as it might be in video driver clients. The existing problem is that any fifo use can interfere with the SETQ read/write bursts and introduce gaps between longs transferred and potentially cause a request to be triggered prematurely with stale data parameters in some mailbox registers. My own video driver client works around this issue by writing the second two mailbox longs first (with a SETQ #1 burst), then the writing to request mailbox long separately after that, but doing this slows down the initial request a little bit. Changing this order would improve that side of things. I also thought that it may let the polling sequence that reads the result and the status be tightened to something like this sample below which would also allow us to use the flags from the last long of the read burst to detect the poll exit condition, however there is still a problem...
mov addr, ##$ABCDEcall #readlong ' read HyperRAM memory' data reg now contains the result
...
' readlong:' input addr - external address to read from (trashed afterwards)' output data - result
readlong
mov mask, #0' set mask = 0 to prevent read-modify-write cyclesetbyte addr, #READLONG, #3' setup 32 bit read request using external addresssetq #3-1' writing 3 longswrlong mask, mailboxPtr ' trigger mailbox requestrep #3, #0' poll for result (15 clock cycle rep loop once aligned to hub)setq #3-1' reading 3 longs (or possibly just two if you had a second mailboxPtr)rdlong mask, mailboxPtr wcz' read memory data and polling statusif_ncretwcz' need to check with evanh if you can really return from the end of a rep loop
mask long0
data long0
addr long0
mailboxPtr long MAILBOXADDR
The new problem is that if the final read burst of the data result+status itself is interrupted by a fifo transfer on the client COG between reading the data and the polling status, you might have stale data read into the data result long, you'd need to read it again after the REP loop completes if you ever use the fifo during the polling operation. So the change of order helps one side but hinders the other side. We sort of want to keep the existing order on polling for the result to prevent this problem. We can't really fix both ends.
It would be a fairly simple change in the PASM driver to reorder the mailbox but the SPIN layer which abstracts this ordering needs to be changed in lots of places (still mostly straightforward). If it is going to happen at all I think it's worth doing now before the code is released because changing it later will affect all the PASM clients.
I'll need to mull this over and I'm really on the fence about doing it now it introduces new problems...any suggestions?
After thinking through a bit more I should just keep the original mailbox order as is. The result polling side should be the one that is optimized given rdlong's typically take longer than wrlongs to execute. Also the fifo use is a special case, and not all PASM clients will need that so they can still use a 3 wrlong burst to setup the mailbox whenever they don't use the fifo, even using the way it works today. With any luck doing this request setup sequence won't add too many extra total clocks anyway for the second WRLONG.
And the read result polling can still exit the mailbox polling loop with the data quickly this way without needing the flags read:
POP exitaddr ' pop return address off the stackREP #3, #0' repeat until result is readySETQ #2-1' read data and statusRDLONG addr, mailboxptr1
TJNS addr, exitaddr ' returns to caller when request is serviced
@evanh In the earlier post the existing mailbox long order was shown but I was just contemplating changing it to enable the request setup/polling according to the sample code provided. However in the end I've decided against changing it as it still introduces a problem on the read side even though it can improve writing the request.
Had a question you might be able to answer with your knowledge of the egg beater timing. In this code sequence:
How long would the second WRLONG take if mailboxPtr1 = mailboxPtr2 - 4 ?
I know that WRLONGs take anywhere from 3-10 clocks, but I'm hoping it might be on the shorter side of that range when it follows the first WRLONG which will already sync up to the egg beater.
... but I'm hoping it might be on the shorter side of that range when it follows the first WRLONG which will already sync up to the egg beater.
It is a determinable amount of sysclocks but it depends the modulo'd difference between the final address of the burst and address of mailboxPrt1. If they both shift in unison, ie: the delta doesn't change, then you're in luck.
EDIT: Basically, if you can arrange addresses of burstEnd % totalCogs == (mailboxPrt1 - 3) % totalCogs then you should achieve optimal WRLONG of 3 sysclock.
EDIT2: I don't think I'd worry about it with a 4-cog prop2, but a 16-cog prop2 would definitely want to use this.
Comments
No point in asking me - I have no idea how XMM is implemented in GCC. I know Steve Densen did some of the original work on the cache which (I believe) both GCC and Catalina use to support XMM where the RAM itself is too slow to access directly (e.g. SRAM) - so you are probably right that they are basically similar.
I am still trying to come to grips with whether it is worth implementing code execution from XMM on the P2. I keep wavering between doing a trivial "P1" type port - which would be very easy but suffer from the same problems as XMM code execution has on the P1 (i.e. that it is quite slow) - or doing something more sophisticated.
The problem is that I really have no idea yet whether XMM will be widely used on the P2. Even on the P1 (which absolutely needed it to execute large programs) it didn't see very much use outside us hardcore fanatics.
I will probably continue to waver on this for a bit yet. There are just too many other interesting things to do!
Either HyperRAM or HyperFlash could be used for the program memory with the same amount of performance. Programs could grow as large as 32MB with the Flash on the P2-EVAL breakout. That's massive. I think this is worth some consideration once people get familiar with how the HyperRAM/Flash could be used here.
With video applications you could still share a video frame buffer in external memory with program memory and give the video driver priority over the caching VM loader. Performance can take a hit but it still could work. There's a lot of memory bandwidth to go around.
We used overlays back in the 70’s on the minis and I suspect even earlier on the mainframes. It was absolutely necessary in a 5KB core memory model (cog) with a shared (ie hub) of 10KB.
Current situation:
COGRAM use 502 longs
LUTRAM use 512 longs
I might be able to free a few more COGRAM longs by sharing registers even more but that also gets risky and makes the code more fragile in any future changes, especially if you give the same register two different names for more clarity. You think you can freely re-use something but it then creates a side-effect somewhere else, and the more branching and skipf code paths etc, the harder this stuff gets to track.
Here's pins P32-P47:
HyperRAM Burst Writes - Data pins registered, Clock pin unregistered =============================== HubStart HyperStart BYTES BLOCKS HR_DIV HR_WRITE HR_READ BASEPIN DRIVE CR0 00040000 003e8fa0 0000c350 2 1 a0cec350 e0cec350 32 7 ff1f ------------------------------------------------------------------------------------------ | COUNT OF BIT ERRORS | |------------------------------------------------------------------------------------------| | | Compensations | | XMUL | 0 1 2 3 4 5 6 7 8 9 | |--------|---------------------------------------------------------------------------------| ... 300 | 399828 400242 399618 399048 0 400687 400098 399863 400250 399445 301 | 400015 400279 399866 399853 0 399099 399788 400982 400451 399256 302 | 399434 400291 399672 398792 0 399643 399664 400503 400084 400247 303 | 400354 399646 400398 400141 0 399740 399815 400467 399417 399182 304 | 399461 399469 399676 400301 0 399913 399116 399890 400066 400520 305 | 400359 400132 400471 400050 0 400554 399894 400055 399169 400547 306 | 399762 399951 399633 400542 0 400285 400642 400205 400625 401250 307 | 400055 399581 400313 399948 0 400090 399640 400285 400224 399772 308 | 400256 400319 399840 400420 0 399549 399547 399891 400879 399912 309 | 399954 399921 400330 400375 0 400486 400530 399327 399628 399428 310 | 399901 400845 399819 400061 0 399190 400112 399622 399490 400139 311 | 399687 399826 400341 399130 0 400161 400120 399717 399916 401052 312 | 400340 400418 400500 400613 0 400549 400256 399733 399830 400725 313 | 400347 399734 400124 399633 0 399231 398971 399759 400249 400693 314 | 400438 400502 400440 401155 0 399682 399825 399901 400024 400078 315 | 400239 400530 400131 400150 0 399074 400356 399792 400817 399448 316 | 400064 400335 400136 399958 1 399491 400386 400571 399900 399381 317 | 399956 399834 399691 400752 0 400450 399327 400501 399379 400078 318 | 400764 400028 400853 399625 0 399809 399661 400141 400173 400122 319 | 399647 399684 400492 399107 0 400345 399829 399920 400280 400620 320 | 400424 400380 399751 400041 1 399806 398797 399988 399113 399122 321 | 399202 400540 399867 399741 0 399446 400328 400547 400617 399992 322 | 400371 400195 399429 399750 2 400792 400472 399964 401351 400252 323 | 399552 399951 399819 400441 6 399205 399767 400019 400029 399876 324 | 400361 400212 399866 399941 4 399704 399531 399438 400107 399676 325 | 399835 400288 400014 399367 3 399075 400266 400550 400008 400084 326 | 400245 400531 400382 398874 6 400437 400288 399909 400079 400152 327 | 399946 399914 400190 399055 10 399533 400312 399975 399794 400696 328 | 399140 399882 400408 400007 6 399792 399947 400855 399614 400075 329 | 399882 400371 400385 400435 4 400301 399222 399698 399763 399799 330 | 399931 399355 399452 399889 14 400283 399716 400041 400472 400237 331 | 399585 400112 399953 399702 21 400489 401103 400571 400231 399667 332 | 400083 400168 399533 399519 10 400544 399541 399465 400156 400220 333 | 399974 400515 400636 399414 13 399598 400180 400049 401114 400735 334 | 399940 399817 400841 400458 15 399570 399482 399209 400474 400020 335 | 399810 399431 400308 399488 7 399804 400123 400274 399853 399974 336 | 399783 400124 400004 400176 13 400083 400636 399685 400060 400301 337 | 399507 400403 399997 399625 14 399707 400848 400231 399387 400008 338 | 400101 399930 399762 400936 21 399936 399771 399824 399014 399699 339 | 400280 400049 400232 400341 20 399866 400165 400427 399687 399601 340 | 399951 400531 400162 400230 25 399685 400009 400025 400449 400531 341 | 400037 400082 400702 399615 19 400452 400670 399349 399766 399527 342 | 400191 400072 400051 400870 29 399968 400621 399665 399743 399377 343 | 399789 400012 400052 399956 26 399891 399217 400415 399098 399953 344 | 400635 399717 401135 400376 20 400384 399762 400387 399505 399924 345 | 399612 400186 400029 399037 24 400270 399213 400027 399780 400447 346 | 400550 399779 399639 400051 29 399898 399847 399407 400757 399031 347 | 399737 399711 399205 399722 29 400370 400258 399018 399389 399813 348 | 400013 400877 399856 399580 45 400562 399977 399272 400207 399686 349 | 399812 400425 399803 399745 21 400153 399482 399631 400280 399451 350 | 400163 400589 399915 399966 24 400184 399733 400290 400222 400912 351 | 399848 399902 399840 400484 28 399763 400110 400364 399549 399838 352 | 400182 399575 400498 399645 25 399935 399838 399402 400166 399770 353 | 400054 400479 400252 400134 21 400690 399961 399616 399658 400595 354 | 400162 399984 400560 400471 19 400600 401076 398929 400207 400279 355 | 400577 398707 399237 400181 24 400375 400232 400253 399682 399938 356 | 398731 399867 400018 400084 25 399292 399691 399482 400246 399744 357 | 400078 399636 400086 400791 20 400339 400088 399749 400402 400418 358 | 400699 399788 400193 400225 23 400135 400491 400706 400188 400078 359 | 399398 399817 400424 400325 31 399718 399294 399304 400313 399704 360 | 399409 399850 399671 399921 28 400134 399673 400496 400267 400737 361 | 398865 399554 400301 399890 24 399916 399752 399994 399972 400108 362 | 399601 400303 399682 399814 22 399859 400015 398922 399823 399931 363 | 400067 400090 400847 399782 37 399344 400577 399640 398585 399716 364 | 399846 400326 398851 400801 24 400014 400009 400748 401041 400290 365 | 400462 399778 400534 400295 36 399943 399327 400096 400384 399299 366 | 400161 401006 400314 399868 35 400631 400370 399861 399681 399789 367 | 399709 399621 400039 399815 33 400473 400190 400186 399520 398993 368 | 400664 399604 399236 399702 36 399459 399348 400124 399654 400085 369 | 399342 400061 399701 399976 34 399938 399312 400334 400447 400015 370 | 399819 399342 400056 400370 45 399778 399878 400185 400576 400660 371 | 400186 400426 400614 399160 41 400135 400365 400389 399811 400073 372 | 400421 399204 399787 399818 37 400072 399848 400777 399876 400027 373 | 400724 399693 400222 400128 43 400107 399751 399887 400326 399241 374 | 400429 400006 400270 400365 61 399752 399501 399668 398990 399567 375 | 400511 400059 399784 399347 44 399465 400277 400197 399675 400267 376 | 399928 399690 399561 400344 66 399520 399767 399840 399935 399561 377 | 400154 400391 399919 400376 57 399961 400499 399662 400001 400209 378 | 399631 399854 400549 399116 43 400156 400716 400269 400221 399769 379 | 400214 399878 399331 399512 63 399705 399822 399973 399602 400432 380 | 399590 400473 400183 400582 64 400249 400223 399765 399518 399210 381 | 400203 399439 399739 400485 81 400501 400516 399490 401436 399570 382 | 399699 399416 400382 400017 68 400051 399909 400499 399433 400352 383 | 400341 400060 399743 400227 106 401119 400673 399867 399378 399582 384 | 399994 399935 400416 399606 102 399831 399746 400320 399737 399829 385 | 401206 400181 400021 400156 100 399874 400544 399984 399492 400325 386 | 400296 400528 398380 399497 85 398731 400734 399813 399834 399302 387 | 400041 400420 400184 399950 99 400911 399621 399749 399189 400541 388 | 399296 400353 401566 399394 100 400671 399946 400095 399864 399286 389 | 400509 400306 400623 400304 93 399768 399769 399776 400066 399605 390 | 399414 399095 400597 400164 122 400202 399900 399838 400517 399879
HyperRAM Burst Writes - Data pins registered, Clock pin unregistered =============================== HubStart HyperStart BYTES BLOCKS HR_DIV HR_WRITE HR_READ BASEPIN DRIVE CR0 00040000 003e8fa0 0000c350 2 1 a0aec350 e0aec350 16 7 ff1f ------------------------------------------------------------------------------------------ | COUNT OF BIT ERRORS | |------------------------------------------------------------------------------------------| | | Compensations | | XMUL | 0 1 2 3 4 5 6 7 8 9 | |--------|---------------------------------------------------------------------------------| ... 300 | 400164 400231 400514 399283 0 399616 399587 400353 400166 399695 301 | 399787 399852 399431 399861 0 399805 399444 399408 400619 399899 302 | 400388 400369 399797 400235 0 400005 400164 400310 399301 399670 303 | 400480 400450 399820 399775 0 398937 399754 400200 399737 399726 304 | 399797 399714 399754 400073 0 400532 399401 399766 400080 399621 305 | 399700 400406 400087 400464 0 400128 400097 400137 400116 399700 306 | 400209 399659 400422 399979 0 400372 400134 399562 399871 400121 307 | 399120 399972 399446 399906 0 400188 399591 400713 401112 399020 308 | 400033 400220 399307 399842 0 400042 399580 400193 399127 399845 309 | 399805 399195 400170 399927 0 399861 399961 399975 399582 400356 310 | 400504 399405 400619 399220 0 399911 399733 399918 399654 399712 311 | 400002 400504 400158 399782 0 399562 399355 399623 400137 399702 312 | 399438 399138 399379 399917 0 399150 399919 399738 400711 400214 313 | 399848 399593 400585 400837 0 400267 400021 399421 399344 400392 314 | 400326 399667 400924 399881 0 399758 400122 399554 400535 399587 315 | 399965 400334 399966 399115 0 399830 399788 399586 400522 399312 316 | 400347 399184 399960 399972 0 400366 400105 400522 399487 399621 317 | 399479 399682 399013 399812 0 400235 400099 400195 399741 399593 318 | 400119 399439 400851 400648 0 399304 400588 399620 399252 400007 319 | 399705 400110 399795 400367 0 400528 399402 400356 399665 400687 320 | 399701 399647 399784 400040 0 399920 399856 399736 400386 399971 321 | 400203 400466 400239 399713 0 399548 399147 400729 399743 399746 322 | 399744 400146 399656 399905 0 399733 399734 399313 399671 400330 323 | 400310 400390 399855 400595 0 399577 400222 400270 400626 399870 324 | 400013 400314 400673 400609 0 399636 399982 400187 399760 400664 325 | 399863 399406 400195 399753 0 400570 399473 400099 400200 399550 326 | 399469 399747 399830 399771 0 398961 400214 399686 399640 400504 327 | 399995 399737 399317 400034 0 399723 400531 401067 400345 400141 328 | 400547 400406 399662 400115 0 400134 399390 399573 399990 400061 329 | 400119 400909 400655 399725 0 400806 399775 400127 400084 400368 330 | 400676 400591 399427 399910 0 400038 399105 400137 399955 401108 331 | 400114 399489 399982 400409 0 399888 400517 400181 400349 399332 332 | 399872 399986 400426 400554 0 399822 399831 400251 399801 400152 333 | 400156 399895 399995 399569 0 399604 399546 400445 399704 399629 334 | 399600 399692 400199 399713 0 400836 400064 399548 399573 399756 335 | 400746 400626 400286 399600 0 399255 400519 400348 400508 400717 336 | 400500 400619 400114 400400 0 399904 399933 399664 400032 399353 337 | 400230 400460 399905 400230 0 399969 399822 400443 400341 399888 338 | 399778 400571 400934 398848 0 400639 399493 400322 399427 400251 339 | 400219 400642 400405 399229 0 399678 399328 399705 399973 399215 340 | 400587 400664 400066 400088 0 400482 399728 399686 398257 400125 341 | 399669 399997 400166 399753 0 399952 400809 399851 401140 400161 342 | 399479 400055 400057 400224 0 399391 400247 399852 400147 399393 343 | 400332 400296 399061 399588 0 400735 400247 399983 399426 399771 344 | 400971 399160 400228 399548 0 400299 399680 400058 399999 398927 345 | 399843 399957 400140 400432 0 399075 401056 400786 400382 400302 346 | 399556 400224 400030 399757 0 399321 400304 398836 399929 400304 347 | 399877 399505 400441 399928 0 400770 399872 399877 399503 399569 348 | 399528 399274 400631 399727 0 399863 399388 399824 399602 399342 349 | 399512 399652 399827 399459 0 399955 400548 400822 399877 400204 350 | 400043 400167 400510 400521 0 401237 399988 399812 400065 400177 351 | 400000 400336 399760 399436 0 399982 400245 399480 399900 399617 352 | 400307 400545 400192 400803 0 400303 400741 399891 399654 400129 353 | 400664 400548 400173 400168 0 400902 399663 400360 399751 399978 354 | 400251 400119 400266 401337 0 400059 400116 399927 399913 400075 355 | 400041 400254 399515 400330 0 400314 399535 399513 400136 399413 356 | 399701 400448 399896 399659 0 399636 399163 400483 399942 399513 357 | 400126 400363 400104 399911 0 400032 399907 399258 399047 400173 358 | 400111 399959 400161 400248 0 400860 399900 400131 400496 399784 359 | 400298 400848 400267 399859 0 399275 400026 400432 399770 399671 360 | 399060 400518 401037 400193 0 400154 400324 399253 399691 400114 361 | 399804 399467 400447 399211 0 399336 399936 399874 400793 400205 362 | 398814 399900 399797 399760 0 399894 399750 400178 399485 400730 363 | 399620 400668 400501 400608 0 400529 399916 399558 399939 400256 364 | 400254 400196 400324 399662 0 399322 399573 400009 399230 400063 365 | 400201 400283 399851 399155 0 399797 399904 399796 399693 399777 366 | 399788 399156 399922 400010 1 400055 400823 400486 399466 400023 367 | 399948 400225 399297 400409 0 400376 399745 400031 400186 399132 368 | 400148 400290 400056 399426 0 400648 399747 399729 400136 399847 369 | 400334 400358 399363 400593 0 399942 399841 399934 400647 399503 370 | 400014 400685 400116 400254 0 399645 400072 400135 401141 400253 371 | 399962 400511 399650 400713 0 400121 399629 399857 401298 399495 372 | 400508 400113 400392 400502 0 399292 400317 399521 399567 399921 373 | 399641 399707 399930 400214 0 400328 399954 399826 400236 399968 374 | 400413 399988 398800 399126 0 399785 399622 400343 399860 399916 375 | 400392 399688 400733 399645 0 400301 399910 399971 400225 400330 376 | 399736 400236 400331 400377 0 399488 400131 400808 400424 400157 377 | 400615 399729 400730 399859 0 399600 400330 400497 399933 400146 378 | 399429 399619 400808 400422 0 399750 399889 399095 399520 399881 379 | 400327 400473 400468 400060 0 400085 400078 399758 399535 399558 380 | 399730 400077 400826 399674 0 399994 399666 400240 400894 400192 381 | 399815 399990 399921 399897 0 399888 399636 399078 400152 400146 382 | 400189 399973 400735 400053 0 400553 399982 400245 399915 399513 383 | 399103 400144 399531 399770 0 399708 400397 399842 400191 399491 384 | 399672 400552 400508 399452 0 400013 400456 400142 399872 399212 385 | 400456 399607 399815 399605 0 400582 399413 400334 399716 400161 386 | 399797 399623 399883 399503 0 400139 400040 400020 400153 399679 387 | 400411 399358 400650 399006 1 399756 399776 400179 399353 400642 388 | 399252 400047 400647 399485 0 399735 400215 400278 400057 399905 389 | 399924 399599 400306 400203 0 400004 400859 399678 400458 399568 390 | 399437 399511 399352 399538 0 399904 399688 399528 400412 400610
Registering the HR clkpin for data reads definitely gives a higher usable clock speed. And setting the CR0 drive strength to 19 ohms does help a little too. I'm getting over 320 MT/s read speed now. On the down side, the 22 pF capacitor definitely drags it down. A dedicated board layout will help a lot.
EDIT: ie: The difference between a basepin of 16 and 32 is tiny and generally doesn't impact reliability.
Did you try a trimmer for that skew-cap ? Now there are nice error tables, there may be a better C value ?
I was quite happy with the measured result on the scope. The 22 pF brought the slope nicely parallel to the data traces and the roughly 1 ns lag was just what I wanted.
There was clear case of attenuation kicking in though. The board layout has a lot of capacitance I suspect. I don't think any signal improvement can be made without a dedicated snug hyperRAM on the prop2 board. Even then, I worry that adding a capacitor will be a major nerf, so want to have the space to modify the first experiment board.
Sadly I don't have any layout even on the drawing board yet.
On the other hand, I still favour keeping P28-P31 and associated VIO away from any connectors. If that VIO is taken out then the prop2 is bricked because the sysclock oscillators won't power up without it.
Is the hyperram rated for the clock speeds you are using?
The data rates you are achieving top mine in a complete
Different project by a good number
Best regards
The HyperFlash is IS26KL256S-DABLI00, also 200 MT/s.
So far, the 1v8 parts of version 2 are available in stock, "S27KS0642*", but the 3v3 "S27KL0642*" versions are not yet in stock. Hopefully soon
When these parts arrive, we will be within spec again
I tested out the read-modify-write feature I had recently added and it appears to work. This now lets us write arbitrary bitfields within individual bytes/words/longs and retrieve the prior value in a single mailbox operation.
It will be useful for semaphores, general bitfield updates and graphics pixel operations on any elements that differ from the native 8/16/32 bit storage sizes.
The existing single element access APIs were just these:
PUB readByte(addr) : r PUB readWord(addr) : r PUB readLong(addr) : r PUB writeByte(addr, data) : r PUB writeWord(addr, data) : r PUB writeLong(addr, data) : r
And the updated API now includes these too:
PUB readModifyByte(addr, data, mask) : r PUB readModifyWord(addr, data, mask) : r PUB readModifyLong(addr, data, mask) : r
The mask is the same size as the element being written (8, 16, or 32 bits), and it's binary ones indicate the bit(s) in the data parameter at these mask bit position(s) that be written to the HyperRAM and overwrite the existing data bit value. When the mask bit zero the corresponding data bit is left alone.
The original value before any update gets applied is also returned by the API.
If the mask used is zero, no updates are applied (and this then defaults to the same behaviour as the general read case in the PASM driver).
I'm now down to 1 long free in COG RAM.
Things are looking good with this final? version in my testing. Tonight I just got a video frame buffer output to my video driver from HyperFlash for the first time ever. So I can send my driver a frame buffer held in either RAM or Flash controlled just by the nominated address which is mapped to one of the devices on the bus. I should be able to graphics copy image data out of flash directly to RAM too, this could be useful for image resources for GUI elements etc. Just about to try that out once I write something useful for testing into flash.
I am very glad I decided to enable different input timing for each different bank back when I was contemplating doing all that. At one time I wasn't fully sure I would to need to do this and it also added some small setup overheads, but it would have been very hard to add it in at this stage. This issue just showed up already as a problem at 200MHz with my video frame buffer. HyperFlash needed a delay of 8 to show stable pixels while RAM wanted 9. This input delay is kept as a per bank parameter so the driver can handle the different input timing per each read access.
5. When Linear Burst is selected by CA[45], the device cannot advance across to next die.
Here's a picture of what happens. I had a frame buffer starting at around 8MB-64kB (with unprovisioned memory but primarily purple colours) and when it crosses at the 8MB boundary, you start to see some colour patterns that I had written at address 0 in the first die in the package. After the scan line burst read completes at the boundary, the frame continues on within the second die (primarily green colours). Same thing happens when wrapping from 16MB back to zero, but the data read at the crossing will be that stored starting at 8MB.
I can't really do much about this now, it is a feature of the HyperRAM MCP itself. The only way we could deal with it would be to specifically test for a 8MB crossing within every burst read and split the bursts at the boundary (a little bit like I did for flash page crossings), but this is not worth the additional overhead on all read bursts and probably also all other fills/copies etc. So to avoid this it is best to just keep your frame buffers (or other data elements) to be fully contained within the same 8MB block if you use an MCP based HyperRAM device like the one on the Parallax HyperRAM module. Future single die HyperRAM devices will probably not have this issue anyway.
Each round-robin (RR) COG does the exact same operation drawing some vertical lines into the frame buffer moving from left to right (covering 1920 pixels wide on a FullHD screen), cycling the colour when it reaches the end and wraps around again to the left side. These operations form a colour row or bar per COG on the screen and one strict priority video COG outputs this screen over VGA. If a round-robin COG is getting more requests serviced compared to others its bar advances faster relative to the others and it looks like a race visually with some less serviced COGs being "lapped" making this nice and easy to see in real time. The request servicing over these 5 RR COGs looks good like this and they advance at pretty much the same speed. I count the requests and take the average as a percentage of the total request count and show this at the top left of the screen per COG.
So when all RR COGs do the same thing and request the same operation taking the same duration, it is fair and there is minimal separation between the different COG's bars. However if one COG is then stopped and becomes inactive, its requests cease and the fairness changes. Instead of being equally allocated to the other 4 RR COGs, one COG is given an advantage and the request share then looks more like this (apologies for the blurry shot):
COG0 40%
COG1 20%
COG2 20%
COG3 20%
COG4 0% (inactive)
I found the issue is the way these RR COGs are polled. Each time a request is serviced the RR COG polling order advances, like this:
Initially: COG0, COG1, COG2, COG3, COG4
next request: COG1, 2, 3, 4, 0
next request: 2, 3, 4, 0, 1
next request: 3, 4, 0, 1, 2
next request: 4, 0, 1, 2, 3
next request: 0, 1, 2, 3, 4 (continuing etc)
This works fairly when all COGs are active. They have an equal time at the first spot, 2nd spot, 3rd, 4th, and last spot. The problem happens if COG4 is inactive, because the next in line is COG0. This essentially gives COG0 two goes at the top spot because COG4 never needs servicing. For 2 in every 5 requests serviced, COG0 gets polled before the others. To fix this requires another more complicated implementation where you select each RR COG only once per polling loop iteration, or somehow determine the full polling order more randomly so inactive COGs aren't followed by the same COG. I think a polling change like that might have to come later if at all. There is a tradeoff between complexity and polling latency here. My current implementation keeps the polling as simple as possible to try to be as fast as it can be. Currently the loop generated for 5 RR COGs and one strict priority video COG would be the one shown below. It builds a skip mask to determine the polling sequence.
poller incmod rrcounter, #4 'cycle the round-robin (RR) counter bmask mask, rrcounter 'generate a RR skip mask from the count shl mask, #1 'don't skip first instruction in skip mask repcount rep #10, #0 'repeat until we get a request for something setq #24-1 'read 24 longs rdlong req0, mbox 'get all mailbox requests and data longs tjs req7, cog7_handler ' (video COG for example) polling_code skipf mask ']dyanmic polling code starts from here.... jatn atn_handler ']JATN triggers reconfiguration tjs req0, cog0_handler '] tjs req1, cog1_handler '] tjs req2, cog2_handler '] tjs req3, cog3_handler '] tjs req4, cog4_handler '] Loop is generated based on tjs req0, cog0_handler '] number of RR COGs tjs req1, cog1_handler '] tjs req2, cog2_handler '] tjs req3, cog3_handler '] tjs req4, cog4_handler ']
I think to work around this polling issue when RR COGs are all doing the same thing and true fairness is needed, it is best to only enable COGs in the RR polling loop that will actually be active and remove any others already in there by default.
When requests are randomly arriving this should be far less of a problem. It's mainly happening when they are all doing the same thing at the same rate, and some COG(s) are idle. I noticed if I add a random delay to each RR client COG after it's request completes the fairness starts to return.
For some time I thought I must have had a bug in the code causing this type of unfairness, but in the end it was just the polling design's own limitation. I needed to write it down and really think about the effect of the polling sequence and what happens when skipping inactive COGs. It would be cool to come up with a fast scheme that somehow improves on this and keeps it fair for equal load even when some COGs are idle and which still fits in the existing COGRAM footprint. If it doesn't fit the space then any change to that poller will have to wait until I free more COGRAM by changing the table lookup method I use and would then add four extra cycles of overhead per request. Doing that can wait for another time though....it's not worth all the extra work right now. I want to get it out, it's working very well already.
mailboxBase + 0: request & bank/external address
mailboxBase + 4: read/write data or read/write hub address for bursts
mailboxBase + 8: mask or transfer count
If I reverse this order it then has the request long written last which triggers the memory request and a SETQ #2 followed by a WRLONG is a safe and fast way to generate memory requests even with the fifo running as it might be in video driver clients. The existing problem is that any fifo use can interfere with the SETQ read/write bursts and introduce gaps between longs transferred and potentially cause a request to be triggered prematurely with stale data parameters in some mailbox registers. My own video driver client works around this issue by writing the second two mailbox longs first (with a SETQ #1 burst), then the writing to request mailbox long separately after that, but doing this slows down the initial request a little bit. Changing this order would improve that side of things. I also thought that it may let the polling sequence that reads the result and the status be tightened to something like this sample below which would also allow us to use the flags from the last long of the read burst to detect the poll exit condition, however there is still a problem...
mov addr, ##$ABCDE call #readlong ' read HyperRAM memory ' data reg now contains the result ... ' readlong: ' input addr - external address to read from (trashed afterwards) ' output data - result readlong mov mask, #0 ' set mask = 0 to prevent read-modify-write cycle setbyte addr, #READLONG, #3 ' setup 32 bit read request using external address setq #3-1 ' writing 3 longs wrlong mask, mailboxPtr ' trigger mailbox request rep #3, #0 ' poll for result (15 clock cycle rep loop once aligned to hub) setq #3-1 ' reading 3 longs (or possibly just two if you had a second mailboxPtr) rdlong mask, mailboxPtr wcz ' read memory data and polling status if_nc ret wcz ' need to check with evanh if you can really return from the end of a rep loop mask long 0 data long 0 addr long 0 mailboxPtr long MAILBOXADDR
The new problem is that if the final read burst of the data result+status itself is interrupted by a fifo transfer on the client COG between reading the data and the polling status, you might have stale data read into the data result long, you'd need to read it again after the REP loop completes if you ever use the fifo during the polling operation. So the change of order helps one side but hinders the other side. We sort of want to keep the existing order on polling for the result to prevent this problem. We can't really fix both ends.
It would be a fairly simple change in the PASM driver to reorder the mailbox but the SPIN layer which abstracts this ordering needs to be changed in lots of places (still mostly straightforward). If it is going to happen at all I think it's worth doing now before the code is released because changing it later will affect all the PASM clients.
I'll need to mull this over and I'm really on the fence about doing it now it introduces new problems...any suggestions?
SETQ #2-1 WRLONG data, mailboxPtr2 WRLONG addr, mailboxPtr1
And the read result polling can still exit the mailbox polling loop with the data quickly this way without needing the flags read:
POP exitaddr ' pop return address off the stack REP #3, #0 ' repeat until result is ready SETQ #2-1 ' read data and status RDLONG addr, mailboxptr1 TJNS addr, exitaddr ' returns to caller when request is serviced
I'm going to leave the order alone.
Had a question you might be able to answer with your knowledge of the egg beater timing. In this code sequence:
SETQ #2-1 WRLONG data, mailboxPtr2 WRLONG addr, mailboxPtr1
How long would the second WRLONG take if mailboxPtr1 = mailboxPtr2 - 4 ?I know that WRLONGs take anywhere from 3-10 clocks, but I'm hoping it might be on the shorter side of that range when it follows the first WRLONG which will already sync up to the egg beater.
EDIT: Basically, if you can arrange addresses of burstEnd % totalCogs == (mailboxPrt1 - 3) % totalCogs then you should achieve optimal WRLONG of 3 sysclock.
EDIT2: I don't think I'd worry about it with a 4-cog prop2, but a 16-cog prop2 would definitely want to use this.