I think the narrowing low down in frequency is because they fail right at my breakpoints designed for RAM ranges. If I tune these differently specifically they may work better and be wider again. Not sure.
Now why it continues to work at higher frequencies I'm not sure. It might be my test itself. I only check 16 bytes and these are the same simple counting pattern at the same address. I need to generate a better flash test pattern to read to be sure it is working right.
Did you ever tryed to rely on Hybrid Bursts to easy the task of dealing with those page boundary crossings (and the lack (or absence) of RWDS (and, sure, also valid data) they introduce), so as to avoid lefting gaps at Hub Ram?
I'm only comparing to prior hyperRAM behaviour. The same narrowing at higher frequencies happened there. What didn't happen was the extra wide highest band.
EDIT: I'm questioning the validity of the "287-360 MHz ok" result. My interest in testing hyperFash myself is near zero. I see it as a burden on the HR performance.
EDIT2: Bah! Those forum icons are too close in looks. I mistook both posts as Roger's.
Did you ever tryed to rely on Hybrid Bursts to easy the task of dealing with those page boundary crossings (and the lack (or absence) of RWDS (and, sure, also valid data) they introduce), so as to avoid lefting gaps at Hub Ram?
No as of right now I was just using linear bursts. Hmm, does hybrid burst mode fully solve it? I guess you need to discard those extra bytes read at the start before it reaches your desired bytes and then starts linear bursting again. That may solve it, and I just need to increase the delay before reading from the streamer. I'll look into it, that may fix at the expense of some additional latency on flash reading, great tip thanks Yanomami.
EDIT: Actually no, I think this hybrid thing works a little differently. You get your desired bytes at the end of the page then a gap while the remaining bytes at the start of the page arrive, then the next page starts linearly after that. So it still requires two streamer commands spaced apart to work I think. Not just an initial delay at the start which I was hoping for. I think I might still have to break apart my bursts into two transactions to fit the model...
I've been working on this flash linear burst page size alignment issue today. I think I've managed to squeeze it in, but I am still testing the idea. It is both helped and complicated by the fact that burst reads can get fragmented.
The way it works is that the first burst read from flash has to compute the bytes remaining in the first 16 byte page being read. If it is the full 16 bytes being read (ie. address least significant nibble = 0) then reads continue as normal, otherwise the first burst fragment size is set to the remaining bytes left in the page, and that proceeds to complete, then any remaining read portions get fragmented as normal. The SPIN2 driver code can ensure that the burst size for flash banks is aligned to a multiple of 16 bytes, and any per COG limit should take this into account too. That way the remainder of any read burst will always resume on another page boundary and the gap problem should go away. For the highest performance the flash can be accessed on page aligned boundaries to avoid the slight extra overhead, but this scheme will allow it to read any number of bytes at any address and not leave gaps in hub RAM. I think this is important to achieve in the driver.
The new code for this takes up about 7 longs and has now totally filled up my LUT RAM. I've had to start to shuffle things around to make room. I think I only have about 9 spare longs left in COG RAM now, and that's it. I might need to hunt for a few more longs soon, especially if I find any errors in the LUT RAM code.
I've not been able to include the single reads of flash longs and words that cross page boundaries in this scheme, only the bursts. So ideally that should not be done when reading from flash, or if they are required then setup a read burst of 2 or 4 bytes instead. I could possibly put this extra page crossing check in the SPIN driver, but it will of course also impact any single RAM reads slightly during the validation phase. TBD.
Writes work a little differently, as only either single words are writeable or bursts. The bursts have to remain within 256 words (512 bytes) and not cross a 512 byte buffer boundary. Written words should already be word aligned and I return an error if they fall on odd addresses.
Request lists will be able to read from flash but writing to it within request lists using special commands like fills and graphics writes will be a problem and is prevented in the driver. Flash sector writing is a special case anyway and requires extra setup commands to enable it so normal word and burst writes can and should still be used for that purpose.
I added in the following final features and now I'm 100% full for both LUT and COG RAM. It's chock-a-block!
1) List cancellation. Client COG can instruct driver to cleanly stop at end of current list item being processed. May be useful for clean shutdowns. Client clears top bit of list request in first mailbox long, driver COG will write zero to this address and stop once it detects this bit is cleared, and it gets polled at end of each list item before advancing. Same mailbox long can also be used to monitor the status of request list progression as it gets updated with the current list item's address being processed by the driver.
2) Prevention of writes to flash in lists when running extended requests such as graphics fills and copies. The single word writes and single flash write bursts can still be put in a request list however. Also HyperFlash can still be read from for graphics operations such as image copies, copies into HyperRAM, wavetable data etc, within request lists.
3) The flash burst read fix for crossing 16 byte page boundaries in HyperFlash has been added. There should now be no gaps for any read address / length as long as the configured burst sizes remain multiples of this page size.
4) Automatic long/word memory address alignment (P1 style addressing) for any atomic 16 or 32 bit HyperFlash word and long reads. This prevents crossing page boundaries too which is good. HyperRAM can still be accessed at any byte address for reads/writes of bytes/words/longs (P2 style addressing), making it more versatile. Flash word writes to unaligned word addresses or an odd number of bytes written will also be detected and return an un-aligned error because the HyperFlash needs 16 bit writes or multiples of that sent each time. Individual byte or long write requests to HyperFlash are not supported and those commands will fail.
Now I just need to validate this and it's then feature complete. I can't fit anything else in! Any bugs introduced here that require further instructions to remedy are going to really challenge me.
@evanh, I retested HyperFlash again and printed out the delay values that work at each frequency (tested from 3-9). If the delay LSB = 1 it means unregistered data bus pins, and 0 = registered data bus pins . Clock pin was unregistered. Results are attached with good delays that read back the expected pattern with sysclk/1 reads from HyperFlash, and delays that failed are omitted from the set. There seems to be reasonable overlap.
I've edited out some repeating values to keep the post size down. In fact the middle frequency of the cut points (...) where it overlaps would make sense to change from one delay value to the next. That makes it around 58MHz, 107MHz, 155MHz, 215MHz, 265MHz, 310MHz. Seems reasonably balanced.
Flash good at 25 MHz - good delays are: 34
Flash good at 26 MHz - good delays are: 34
Flash good at 27 MHz - good delays are: 34
Flash good at 28 MHz - good delays are: 34
Flash good at 29 MHz - good delays are: 34
...
Flash good at 88 MHz - good delays are: 34
Flash good at 89 MHz - good delays are: 34
Flash good at 90 MHz - good delays are: 34
Flash good at 91 MHz - good delays are: 34
Flash good at 92 MHz - good delays are: 34
Flash good at 93 MHz - good delays are: 4
Flash good at 94 MHz - good delays are: 4
Flash good at 95 MHz - good delays are: 45
Flash good at 96 MHz - good delays are: 45
Flash good at 97 MHz - good delays are: 45
Flash good at 98 MHz - good delays are: 45
Flash good at 99 MHz - good delays are: 45
...
Flash good at 115 MHz - good delays are: 45
Flash good at 116 MHz - good delays are: 45
Flash good at 117 MHz - good delays are: 45
Flash good at 118 MHz - good delays are: 45
Flash good at 119 MHz - good delays are: 5
Flash good at 120 MHz - good delays are: 5
Flash good at 121 MHz - good delays are: 5
Flash good at 122 MHz - good delays are: 5
Flash good at 123 MHz - good delays are: 5
Flash good at 124 MHz - good delays are: 5
Flash good at 125 MHz - good delays are: 56
Flash good at 126 MHz - good delays are: 56
Flash good at 127 MHz - good delays are: 56
Flash good at 128 MHz - good delays are: 56
...
Flash good at 183 MHz - good delays are: 56
Flash good at 184 MHz - good delays are: 56
Flash good at 185 MHz - good delays are: 56
Flash good at 186 MHz - good delays are: 56
Flash good at 187 MHz - good delays are: 6
Flash good at 188 MHz - good delays are: 6
Flash good at 189 MHz - good delays are: 6
Flash good at 190 MHz - good delays are: 6
Flash good at 191 MHz - good delays are: 6
Flash good at 192 MHz - good delays are: 67
Flash good at 193 MHz - good delays are: 67
Flash good at 194 MHz - good delays are: 67
Flash good at 195 MHz - good delays are: 67
...
Flash good at 233 MHz - good delays are: 67
Flash good at 234 MHz - good delays are: 67
Flash good at 235 MHz - good delays are: 67
Flash good at 236 MHz - good delays are: 67
Flash good at 237 MHz - good delays are: 7
Flash good at 238 MHz - good delays are: 7
Flash good at 239 MHz - good delays are: 7
Flash good at 240 MHz - good delays are: 7
Flash good at 241 MHz - good delays are: 7
Flash good at 242 MHz - good delays are: 7
Flash good at 243 MHz - good delays are: 7
Flash good at 244 MHz - good delays are: 7
Flash good at 245 MHz - good delays are: 7
Flash good at 246 MHz - good delays are: 7
Flash good at 247 MHz - good delays are: 7
Flash good at 248 MHz - good delays are: 7
Flash good at 249 MHz - good delays are: 7
Flash good at 250 MHz - good delays are: 78
Flash good at 251 MHz - good delays are: 78
Flash good at 252 MHz - good delays are: 78
Flash good at 253 MHz - good delays are: 78
...
Flash good at 276 MHz - good delays are: 78
Flash good at 277 MHz - good delays are: 78
Flash good at 278 MHz - good delays are: 78
Flash good at 279 MHz - good delays are: 78
Flash good at 280 MHz - good delays are: 8
Flash good at 281 MHz - good delays are: 8
Flash good at 282 MHz - good delays are: 8
Flash good at 283 MHz - good delays are: 8
Flash good at 284 MHz - good delays are: 8
Flash good at 285 MHz - good delays are: 8
Flash good at 286 MHz - good delays are: 89
Flash good at 287 MHz - good delays are: 89
Flash good at 288 MHz - good delays are: 89
Flash good at 289 MHz - good delays are: 89
...
Flash good at 331 MHz - good delays are: 89
Flash good at 332 MHz - good delays are: 89
Flash good at 333 MHz - good delays are: 89
Flash good at 334 MHz - good delays are: 89
Flash good at 335 MHz - good delays are: 9
Flash good at 336 MHz - good delays are: 9
Flash good at 337 MHz - good delays are: 9
Flash good at 338 MHz - good delays are: 9
...
Flash good at 355 MHz - good delays are: 9
Flash good at 356 MHz - good delays are: 9
Flash good at 357 MHz - good delays are: 9
Flash good at 358 MHz - good delays are: 9
Flash good at 359 MHz - good delays are: 9
Flash good at 360 MHz - good delays are: 9
Here is this graphically for both RAM and Flash showing the overlapping ranges that worked from 25MHz to 360MHz for sysclk/1 reads. Interesting that the RAM is different and is the one that has narrower bands. I noticed in the data sheets that the default drive strength impedance is 27 ohms for the HyperFlash vs 34 ohms for HyperRAM. Maybe that makes a slight difference...?
I retested HyperFlash again and printed out the delay values that work at each frequency (tested from 3-9).
Nice tables, I wonder how those move with temperature and if there is a single value that can be applied over a practical temperature range, or if this needs temp sense and live adjust (which would be more of a pain).
I've seen newer RAM parts specify ROM pattern areas, that can at least assist with bus tuning - maybe these issues are more widespread ?
... I noticed in the data sheets that the default drive strength impedance is 27 ohms for the HyperFlash vs 34 ohms for HyperRAM. Maybe that makes a slight difference...?
Damn good point, I sure hope so. The difference sure is dramatic imho.
... I noticed in the data sheets that the default drive strength impedance is 27 ohms for the HyperFlash vs 34 ohms for HyperRAM. Maybe that makes a slight difference...?
Damn good point, I sure hope so. The difference sure is dramatic imho.
Well I just now set the HyperRAM CR0 regs to 27 ohms impedance, and the profile looks the same as it was for sysclk/1 though - it still doesn't match the HyperFlash which is a pity.
I retested HyperFlash again and printed out the delay values that work at each frequency (tested from 3-9).
Nice tables, I wonder how those move with temperature and if there is a single value that can be applied over a practical temperature range, or if this needs temp sense and live adjust (which would be more of a pain).
I've seen newer RAM parts specify ROM pattern areas, that can at least assist with bus tuning - maybe these issues are more widespread ?
At room temp I think it is somewhat stable, but the temp extremes do show variation and evanh did some earlier work on that. A ROM pattern would be nice. In theory the RAM could be scanned through at init time if someone wanted to try different delay values, the problem is you'll get two or one working read delay values, and when it is two, you don't know which one is best unless you actually scan over the frequency range at that temperature.
It's probably easiest to assume linear variation over temp and change your delay accordingly. If you know the current chip/board temp and it changes slowly you could have a COG adapting the driver delay. For sysclk/2 it's probably less of an issue given how wide the bands are for that but still possibly could have an impact at some point.
Writes don't show this timing problem at sysclk/2 thankfully as the clock is centered in the middle of the bit.
@"Dave Hein" No it needs to be a single value. This is how the delay parameter is used below in the read code. The "delay" register here is actually the number in the charts above divided by two, because the LSB is used elsewhere to gain a half step of delay by selecting between registered vs live I/O input (regdatabus) which introduces a small amount of extra delay and is ideal to be able to transition between bands. If we didn't have that there would be some frequencies that become unusable with HyperRAM (at sysclk/1 input rates).
Basically if you set the delay too high and wait too long to start the streamer you miss the first byte(s) coming back from the HyperRAM. If you see the delay too low then you don't wait long enough and the streamer will clock in $FF from the undriven data bus before the HyperRAM has a chance to respond to the clock you are sending it. So there is a sweet spot. Unfortunately as well as varying with the P2 clock rate, it also varies with temperature as @evanh found.
wxpin clockdiv, clkpin 'adjust transition delay to # clockssetxfrq xfreqr 'setup streamer frequencywypin clks, clkpin 'setup number of transfer clockswrpin regdatabus, datapins 'setup data bus inputs as registered or notwaitx delay 'tuning delay for input data readingxinit xrecv, #0'start data transfer and then jump to setup code
It also may vary from chip to chip, or from batch to batch, or maybe varies depending on which pin is used. Tweaked code based on the timing of a few chips is a little scary.
Dave,
It is hairy, but it's not that bad. Chip fabrication is very consistent. The biggest variability is temperature ... And board layout. A different board will give different outcome unless they all have a spec to conform to. That's something I'm keen to have except I don't have the knowledge or experience myself.
JMG or Mark T might have the knowledge and experience. Von is working on the revC Eval Board with this sort of thing in mind but I'm not sure what can be achieved with general expansion headers compared to a dedicated hyperRAM right next to the prop2.
... Or are you saying that 7 and 8 are the smallest delays that work, and in that case wouldn't it just be 7?
The potential number of choices for the compensation is dependant on the ratio to sysclock. If it's sysclock/1 then there can only be one value that works at any given frequency, if that. For sysclock/2 there is potential for two workable compensations at any given frequency. And on it goes for /3 having three compensations, /4 having four compensations ...
Here's a sysclock/4 example of read data. You can see the first shift of reliable operation occurs begins just above 80 MHz. For a short band it only has three working compensations. If that was sysclock/1 there wouldn't be any working compensation value for a short band.
I have managed to scrounge a few more longs in the code by sharing registers in different places etc and think with some effort I might free up just enough to get Read-Modify-Write supported as one final feature of this driver.
The only issue is I need to change the single element (byte/word/long) read from the simple single mailbox write into 2 mailbox writes.
The single element read request format is currently this in HUB RAM:
mailbox + 0 : read request (byte/word/long) | external address
mailbox + 4 : don't care
mailbox + 8 : don't care
To support a read/modify/write request I would need to change it to this:
mailbox + 0 : read request (byte/word/long) | external address
mailbox + 4 : new data value
mailbox + 8 : mask
The completion of the read code path would be altered to examine the mailbox+8 long (mask) to see if it was zero or not. If it is zero it would complete the read as normal and the data would be returned in mailbox+4 in HUB. If the mask was non zero it would be applied to the just read value and the relevant bits in new data value would be updated according to the mask bits (either with SETQ/MUXQ or AND/OR etc) and written back to the address just read using the same element size.
Importantly the original read value would still be returned in mailbox+4. This allows a read-update to be supported for semaphores etc. Eg. you try to set a bit and see it it was already a 1 or a 0 before you set it, indicating it is already in use. I would always run this read-update cycle as a back to back operation on the bus so no other COGs could affect the change. This feature would also be very handy for graphics updates of pixel data that is smaller than a byte and it avoids multiple mailbox request to do this and any associated polling delay between them.
The only downside to this approach is that normal reads which used to be an easy matter of writing a single long to the first mailbox to trigger them would now need to ensure that the mask mailbox entry is also cleared to zero in case it has been changed by any other request such as a write since the last read was done. So it typically adds an extra write by the client. This is mainly of concern to PASM clients not so much SPIN2 clients as I will have the SPIN2 API do it for you. Eg. just one new line gets added to the code below. The extra overhead time is probably not that big a deal given the performance of single reads is already limited by much larger overhead and you'll typically want to use burst reads anyway, but it is still annoying and I am still deciding whether this change is worth it. Until I code it up and make sure it fits in the freed space I guess it is moot. It will be really tight. But it could be good.
Any thoughts?
PUBreadWord(addr) : r | mif MAX_INSTANCES == 1' optimization for single instance, everything mapped to single bus
m := mailboxAddrCog[cogid()] ' get mailbox base address for this COGif m == 0' prevent hang if driver is not runningreturn -1else' multiple buses, need to lookup address to find mailbox for bus
m := addrMap[addr>>24]
if m +> MAX_INSTANCES-1' if address not mapped, exitreturn -1
m := mailboxAddr[m] + cogid()*12long[m+2] := 0'<------------------ NEW LINE NEEDED TO AVOID READ-MODIFY-WRITElong[m] := R_READWORD + (addr & $fffffff)
repeatuntillong[m] >= 0returnlong[m][1]
Given that each cog has it's own mailbox, PASM code that never sets that value can just ignore it or clear it only when it knows that it needs clearing
(Also, speaking of overhead, I think there'd be tremendous value in a cut-down, low-overhead driver. Only one mailbox, one RAM bank,etc, so you can do fast-ish small accesses (like one would need for XMM, emulators, etc.).
Given that each cog has it's own mailbox, PASM code that never sets that value can just ignore it or clear it only when it knows that it needs clearing
Yes that third mailbox long would remain at zero after single reads, so if you did multiple single reads in a row, you could avoid the clearing each time after being setup just once at the start. The PASM will know what it is doing so it can make the decision on what to do as needed. That can be helpful. I just liked that single long mailbox write to trigger a read, it was so simple.
(Also, speaking of overhead, I think there'd be tremendous value in a cut-down, low-overhead driver. Only one mailbox, one RAM bank,etc, so you can do fast-ish small accesses (like one would need for XMM, emulators, etc.).
Yeah there are plenty of features that could be removed/hard coded to speed up the whole thing. I expect it can/should be done after the main code is complete as it is simpler to remove than to hand craft in once you know what it needs to do and I've also mentioned this in the past. But the features I've included in the full version should be quite useful especially for the combined HyperFlash + HyperRam case with the P2 EVAL module, as well as for GUI and external memory graphics. We trade off a bit of performance for this versatility. For medium sized transfer bursts it won't make a huge difference but for individual random access use we could certainly speed it up with a cut down variant.
I'm also thinking it would be cool to have some type of XMM model like we used to have on the P1, but somehow using the HyperRAM and/or HyperFlash with caching. But I don't know how it could work yet, and whether it could makes use of Hub exec or not. Without caching the performance won't be good, but with caching enabled it might end up working out okay for running very large programs. Something for the future...
First because of the use to do masked writes down to bit level, that is a very useful addition to byte/word/long. And second the semaphore thing. Not sure where I would need it but it seems to be quite useful too.
The usual P1 XMM model (as implemented in GCC and probably similiar in Catalina, but IDK ask @RossH ) is mostly transparent to the running code - it just has to use special function calls for jumps and any memory access that may be external. This is relatively easy to hack onto an existing compiler, but is very very slow, because every instruction fetch goes through a bounds check to determine if a it crosses into the next cache line and every jump needs to be a function call. This approach would be even slower on P2, because hubexec could not be used (OR COULD IT? The hardware breakpoint could make it work!).
I myself have also used the XMM moniker for a model where the code is fully aware that it and any external data it might want to use is being sliced into 512 byte pages and moved to hub for processing. This is fast because I can arrange the code to minimize external memory access and after being moved to Hub, code runs at usual LMM speed, because the assembler forces the last instruction in any page to always be a "jmp #vm_nextpage" or an unconditional jump/return. This model is much harder to support in a compiler though.
A single mailbox, single bank cut down driver without any special features like lists, fills, multi-bank copies, graphics transfers, register access, burst/latency control etc could be sped up quite a bit. The polling loop could also come down to within 16 clocks once aligned with the egg-beater, and you can still get all 3 longs read in the one poll loop, saving any additional reads later. It fits nicely:
rep #3, #0setq #2' 2 clocksrdlong request, mboxaddr ' 11 clocks for reading 3 longs once aligned to hubtjs request, #service_request ' 2 clocks
Looking at the total code saved in the read path I'd roughly estimate a doubling of the request rate could just about be had for the cut down driver for single element transfers. So let's say ~2M/s instead of ~1M/s at about 250MHz or so.
As it is right now the single element read code path is about 84 instructions long plus the mailbox polling loop which varies with the number of active COGs but is 40 cycles at best for a single COG for this comparison. As well as the HyperRAM transfer itself this code path currently reads and sets up the different bank control pins, reads and applies per bank burst settings from LUT, extracts per COG mailbox settings and other state, applies per bank latencies and read delays, applies per COG burst settings, tests for list requests, and sets up round robin fairness for the next poll. A minimalist implementation would remove all of this and it would then be in the vicinity of 44 instructions plus its tighter shorter polling loop.
For burst transfers this gain will be reduced as the size increases because some extra work gets done during the actual transfer time itself, but there will still be some gains there too.
I do think a cut down driver for tightly coupled applications could be useful to include as well as the fully featured one. It just would not be as useful for graphics or for multiple COGs sharing the common memory. You could also only use either the HyperFlash or the HyperRAM in your application, not both, unless perhaps two driver COGs were spawned and they were never active at the same time, carefully controlled by the application COG using it.
Comments
Now why it continues to work at higher frequencies I'm not sure. It might be my test itself. I only check 16 bytes and these are the same simple counting pattern at the same address. I need to generate a better flash test pattern to read to be sure it is working right.
Did you ever tryed to rely on Hybrid Bursts to easy the task of dealing with those page boundary crossings (and the lack (or absence) of RWDS (and, sure, also valid data) they introduce), so as to avoid lefting gaps at Hub Ram?
EDIT: I'm questioning the validity of the "287-360 MHz ok" result. My interest in testing hyperFash myself is near zero. I see it as a burden on the HR performance.
EDIT2: Bah! Those forum icons are too close in looks. I mistook both posts as Roger's.
No as of right now I was just using linear bursts. Hmm, does hybrid burst mode fully solve it? I guess you need to discard those extra bytes read at the start before it reaches your desired bytes and then starts linear bursting again. That may solve it, and I just need to increase the delay before reading from the streamer. I'll look into it, that may fix at the expense of some additional latency on flash reading, great tip thanks Yanomami.
EDIT: Actually no, I think this hybrid thing works a little differently. You get your desired bytes at the end of the page then a gap while the remaining bytes at the start of the page arrive, then the next page starts linearly after that. So it still requires two streamer commands spaced apart to work I think. Not just an initial delay at the start which I was hoping for. I think I might still have to break apart my bursts into two transactions to fit the model...
The way it works is that the first burst read from flash has to compute the bytes remaining in the first 16 byte page being read. If it is the full 16 bytes being read (ie. address least significant nibble = 0) then reads continue as normal, otherwise the first burst fragment size is set to the remaining bytes left in the page, and that proceeds to complete, then any remaining read portions get fragmented as normal. The SPIN2 driver code can ensure that the burst size for flash banks is aligned to a multiple of 16 bytes, and any per COG limit should take this into account too. That way the remainder of any read burst will always resume on another page boundary and the gap problem should go away. For the highest performance the flash can be accessed on page aligned boundaries to avoid the slight extra overhead, but this scheme will allow it to read any number of bytes at any address and not leave gaps in hub RAM. I think this is important to achieve in the driver.
The new code for this takes up about 7 longs and has now totally filled up my LUT RAM. I've had to start to shuffle things around to make room. I think I only have about 9 spare longs left in COG RAM now, and that's it. I might need to hunt for a few more longs soon, especially if I find any errors in the LUT RAM code.
I've not been able to include the single reads of flash longs and words that cross page boundaries in this scheme, only the bursts. So ideally that should not be done when reading from flash, or if they are required then setup a read burst of 2 or 4 bytes instead. I could possibly put this extra page crossing check in the SPIN driver, but it will of course also impact any single RAM reads slightly during the validation phase. TBD.
Writes work a little differently, as only either single words are writeable or bursts. The bursts have to remain within 256 words (512 bytes) and not cross a 512 byte buffer boundary. Written words should already be word aligned and I return an error if they fall on odd addresses.
Request lists will be able to read from flash but writing to it within request lists using special commands like fills and graphics writes will be a problem and is prevented in the driver. Flash sector writing is a special case anyway and requires extra setup commands to enable it so normal word and burst writes can and should still be used for that purpose.
1) List cancellation. Client COG can instruct driver to cleanly stop at end of current list item being processed. May be useful for clean shutdowns. Client clears top bit of list request in first mailbox long, driver COG will write zero to this address and stop once it detects this bit is cleared, and it gets polled at end of each list item before advancing. Same mailbox long can also be used to monitor the status of request list progression as it gets updated with the current list item's address being processed by the driver.
2) Prevention of writes to flash in lists when running extended requests such as graphics fills and copies. The single word writes and single flash write bursts can still be put in a request list however. Also HyperFlash can still be read from for graphics operations such as image copies, copies into HyperRAM, wavetable data etc, within request lists.
3) The flash burst read fix for crossing 16 byte page boundaries in HyperFlash has been added. There should now be no gaps for any read address / length as long as the configured burst sizes remain multiples of this page size.
4) Automatic long/word memory address alignment (P1 style addressing) for any atomic 16 or 32 bit HyperFlash word and long reads. This prevents crossing page boundaries too which is good. HyperRAM can still be accessed at any byte address for reads/writes of bytes/words/longs (P2 style addressing), making it more versatile. Flash word writes to unaligned word addresses or an odd number of bytes written will also be detected and return an un-aligned error because the HyperFlash needs 16 bit writes or multiples of that sent each time. Individual byte or long write requests to HyperFlash are not supported and those commands will fail.
Now I just need to validate this and it's then feature complete. I can't fit anything else in! Any bugs introduced here that require further instructions to remedy are going to really challenge me.
I've edited out some repeating values to keep the post size down. In fact the middle frequency of the cut points (...) where it overlaps would make sense to change from one delay value to the next. That makes it around 58MHz, 107MHz, 155MHz, 215MHz, 265MHz, 310MHz. Seems reasonably balanced.
Flash good at 25 MHz - good delays are: 3 4 Flash good at 26 MHz - good delays are: 3 4 Flash good at 27 MHz - good delays are: 3 4 Flash good at 28 MHz - good delays are: 3 4 Flash good at 29 MHz - good delays are: 3 4 ... Flash good at 88 MHz - good delays are: 3 4 Flash good at 89 MHz - good delays are: 3 4 Flash good at 90 MHz - good delays are: 3 4 Flash good at 91 MHz - good delays are: 3 4 Flash good at 92 MHz - good delays are: 3 4 Flash good at 93 MHz - good delays are: 4 Flash good at 94 MHz - good delays are: 4 Flash good at 95 MHz - good delays are: 4 5 Flash good at 96 MHz - good delays are: 4 5 Flash good at 97 MHz - good delays are: 4 5 Flash good at 98 MHz - good delays are: 4 5 Flash good at 99 MHz - good delays are: 4 5 ... Flash good at 115 MHz - good delays are: 4 5 Flash good at 116 MHz - good delays are: 4 5 Flash good at 117 MHz - good delays are: 4 5 Flash good at 118 MHz - good delays are: 4 5 Flash good at 119 MHz - good delays are: 5 Flash good at 120 MHz - good delays are: 5 Flash good at 121 MHz - good delays are: 5 Flash good at 122 MHz - good delays are: 5 Flash good at 123 MHz - good delays are: 5 Flash good at 124 MHz - good delays are: 5 Flash good at 125 MHz - good delays are: 5 6 Flash good at 126 MHz - good delays are: 5 6 Flash good at 127 MHz - good delays are: 5 6 Flash good at 128 MHz - good delays are: 5 6 ... Flash good at 183 MHz - good delays are: 5 6 Flash good at 184 MHz - good delays are: 5 6 Flash good at 185 MHz - good delays are: 5 6 Flash good at 186 MHz - good delays are: 5 6 Flash good at 187 MHz - good delays are: 6 Flash good at 188 MHz - good delays are: 6 Flash good at 189 MHz - good delays are: 6 Flash good at 190 MHz - good delays are: 6 Flash good at 191 MHz - good delays are: 6 Flash good at 192 MHz - good delays are: 6 7 Flash good at 193 MHz - good delays are: 6 7 Flash good at 194 MHz - good delays are: 6 7 Flash good at 195 MHz - good delays are: 6 7 ... Flash good at 233 MHz - good delays are: 6 7 Flash good at 234 MHz - good delays are: 6 7 Flash good at 235 MHz - good delays are: 6 7 Flash good at 236 MHz - good delays are: 6 7 Flash good at 237 MHz - good delays are: 7 Flash good at 238 MHz - good delays are: 7 Flash good at 239 MHz - good delays are: 7 Flash good at 240 MHz - good delays are: 7 Flash good at 241 MHz - good delays are: 7 Flash good at 242 MHz - good delays are: 7 Flash good at 243 MHz - good delays are: 7 Flash good at 244 MHz - good delays are: 7 Flash good at 245 MHz - good delays are: 7 Flash good at 246 MHz - good delays are: 7 Flash good at 247 MHz - good delays are: 7 Flash good at 248 MHz - good delays are: 7 Flash good at 249 MHz - good delays are: 7 Flash good at 250 MHz - good delays are: 7 8 Flash good at 251 MHz - good delays are: 7 8 Flash good at 252 MHz - good delays are: 7 8 Flash good at 253 MHz - good delays are: 7 8 ... Flash good at 276 MHz - good delays are: 7 8 Flash good at 277 MHz - good delays are: 7 8 Flash good at 278 MHz - good delays are: 7 8 Flash good at 279 MHz - good delays are: 7 8 Flash good at 280 MHz - good delays are: 8 Flash good at 281 MHz - good delays are: 8 Flash good at 282 MHz - good delays are: 8 Flash good at 283 MHz - good delays are: 8 Flash good at 284 MHz - good delays are: 8 Flash good at 285 MHz - good delays are: 8 Flash good at 286 MHz - good delays are: 8 9 Flash good at 287 MHz - good delays are: 8 9 Flash good at 288 MHz - good delays are: 8 9 Flash good at 289 MHz - good delays are: 8 9 ... Flash good at 331 MHz - good delays are: 8 9 Flash good at 332 MHz - good delays are: 8 9 Flash good at 333 MHz - good delays are: 8 9 Flash good at 334 MHz - good delays are: 8 9 Flash good at 335 MHz - good delays are: 9 Flash good at 336 MHz - good delays are: 9 Flash good at 337 MHz - good delays are: 9 Flash good at 338 MHz - good delays are: 9 ... Flash good at 355 MHz - good delays are: 9 Flash good at 356 MHz - good delays are: 9 Flash good at 357 MHz - good delays are: 9 Flash good at 358 MHz - good delays are: 9 Flash good at 359 MHz - good delays are: 9 Flash good at 360 MHz - good delays are: 9
I've seen newer RAM parts specify ROM pattern areas, that can at least assist with bus tuning - maybe these issues are more widespread ?
At room temp I think it is somewhat stable, but the temp extremes do show variation and evanh did some earlier work on that. A ROM pattern would be nice. In theory the RAM could be scanned through at init time if someone wanted to try different delay values, the problem is you'll get two or one working read delay values, and when it is two, you don't know which one is best unless you actually scan over the frequency range at that temperature.
It's probably easiest to assume linear variation over temp and change your delay accordingly. If you know the current chip/board temp and it changes slowly you could have a COG adapting the driver delay. For sysclk/2 it's probably less of an issue given how wide the bands are for that but still possibly could have an impact at some point.
Writes don't show this timing problem at sysclk/2 thankfully as the clock is centered in the middle of the bit.
Or are you saying that 7 and 8 are the smallest delays that work, and in that case wouldn't it just be 7?
Basically if you set the delay too high and wait too long to start the streamer you miss the first byte(s) coming back from the HyperRAM. If you see the delay too low then you don't wait long enough and the streamer will clock in $FF from the undriven data bus before the HyperRAM has a chance to respond to the clock you are sending it. So there is a sweet spot. Unfortunately as well as varying with the P2 clock rate, it also varies with temperature as @evanh found.
wxpin clockdiv, clkpin 'adjust transition delay to # clocks setxfrq xfreqr 'setup streamer frequency wypin clks, clkpin 'setup number of transfer clocks wrpin regdatabus, datapins 'setup data bus inputs as registered or not waitx delay 'tuning delay for input data reading xinit xrecv, #0 'start data transfer and then jump to setup code
It is hairy, but it's not that bad. Chip fabrication is very consistent. The biggest variability is temperature ... And board layout. A different board will give different outcome unless they all have a spec to conform to. That's something I'm keen to have except I don't have the knowledge or experience myself.
JMG or Mark T might have the knowledge and experience. Von is working on the revC Eval Board with this sort of thing in mind but I'm not sure what can be achieved with general expansion headers compared to a dedicated hyperRAM right next to the prop2.
HyperRAM Burst Reads - Data pins registered, Clock pin unregistered =============================== HubStart HyperStart BYTES BLOCKS HR_DIV HR_WRITE HR_READ BASEPIN 00040000 003e8fa0 0000c350 2 4 a0aec350 e0aec350 16 ------------------------------------------------------------------------------------------ | COUNT OF BIT ERRORS | |------------------------------------------------------------------------------------------| | | Compensations | | XMUL | 0 1 2 3 4 5 6 7 8 9 | |--------|---------------------------------------------------------------------------------| 40 | 400023 400081 401312 0 0 0 0 400061 399815 400100 41 | 400307 400415 399928 0 0 0 0 399737 399487 400237 42 | 400527 399662 399867 0 0 0 0 400467 399733 400060 43 | 399686 400652 399471 0 0 0 0 400053 400041 400412 44 | 400291 400649 400038 0 0 0 0 399070 399576 399455 45 | 400212 400809 400188 0 0 0 0 399695 401213 400150 46 | 399641 401091 400116 0 0 0 0 399477 399733 399646 47 | 399740 400680 400290 0 0 0 0 400156 400525 400920 48 | 400102 400789 399562 0 0 0 0 400746 400328 400046 49 | 399391 399745 399596 0 0 0 0 400018 400521 400932 50 | 400946 399454 399659 0 0 0 0 399523 399699 399458 51 | 399374 398973 399548 0 0 0 0 399806 400008 399629 52 | 400292 399938 399978 0 0 0 0 399277 399654 399298 53 | 400042 400099 400095 0 0 0 0 400397 400904 399952 54 | 399643 399724 399595 0 0 0 0 400042 400427 399890 55 | 400153 400387 399591 0 0 0 0 400424 399735 399875 56 | 399506 399933 400718 0 0 0 0 400548 400159 399670 57 | 400120 399743 399752 0 0 0 0 400407 398947 399952 58 | 400346 399618 400062 0 0 0 0 400186 400203 400042 59 | 399629 399723 399601 0 0 0 0 400367 399625 399624 60 | 400121 400263 400029 0 0 0 0 399571 400437 399603 61 | 399876 400245 399306 0 0 0 0 400278 399112 399539 62 | 400016 400009 399779 0 0 0 0 400147 399502 400537 63 | 399611 400069 399850 0 0 0 0 400179 400654 399892 64 | 400000 399363 400086 0 0 0 0 399484 399600 399580 65 | 399703 399292 400856 0 0 0 0 399786 399492 399230 66 | 400288 400373 400230 0 0 0 0 399919 400077 400110 67 | 400242 400051 399859 0 0 0 0 399444 399505 399874 68 | 399336 400299 399394 0 0 0 0 399764 399843 400258 69 | 400513 399607 399853 0 0 0 0 400136 400155 399906 70 | 400595 399438 400087 0 0 0 0 400119 400377 399852 71 | 399569 400539 400542 0 0 0 0 399747 400790 399652 72 | 399458 400141 400604 0 0 0 0 400060 400233 399486 73 | 399957 399655 400984 0 0 0 0 399562 399228 400213 74 | 399416 400111 400107 0 0 0 0 400482 400227 400265 75 | 399820 400226 400676 0 0 0 0 400743 400161 400090 76 | 399350 400200 400042 0 0 0 0 400011 399225 399648 77 | 399793 399469 399983 0 0 0 0 400332 399642 399400 78 | 400501 400041 399781 0 0 0 0 400295 399942 399463 79 | 399667 399873 400189 0 0 0 0 400788 400881 400089 80 | 399421 400623 399557 0 0 0 0 399773 399882 400451 81 | 400456 399747 399486 0 0 0 0 400036 399876 400137 82 | 400202 399576 399626 0 0 0 0 400698 399948 399605 83 | 399695 400360 400302 0 0 0 0 399985 400164 399314 84 | 400350 400099 400146 0 0 0 0 399356 400300 399790 85 | 399164 399717 399498 1 0 0 0 400638 399917 399698 86 | 400064 399876 399851 222 0 0 0 399112 400982 400165 87 | 400235 400488 400118 5064 0 0 0 395972 400084 400525 88 | 399734 399611 399634 20201 0 0 0 380206 399663 400233 89 | 399631 400773 399600 45516 0 0 0 353482 399633 400986 90 | 400554 399662 399853 78316 0 0 0 321367 399807 400573 91 | 400082 400030 399308 110141 0 0 0 290079 400599 400571 92 | 399461 400595 400015 134671 0 0 0 264881 400459 400489 93 | 400783 400068 399435 168418 0 0 0 232033 399866 399434 94 | 400555 399715 400395 213113 0 0 0 187269 400744 399364 95 | 399438 399519 400165 246660 0 0 0 152801 399940 399906 96 | 399994 400052 400253 286356 0 0 0 113425 399338 399797 97 | 399406 400241 399603 321194 0 0 0 78959 399690 399566 98 | 399631 399216 400504 346789 0 0 0 53291 400428 400000 99 | 400158 399839 399695 374394 0 0 0 26198 400477 399932 100 | 399150 398852 400289 392128 0 0 0 8160 399466 399710 101 | 399904 400220 399130 399048 0 0 0 791 400321 399331 102 | 399560 399719 400580 400027 0 0 0 1 399627 399386 103 | 400431 400280 400274 400109 0 0 0 0 400766 399381 104 | 400611 399428 399700 400720 0 0 0 0 399371 399940 105 | 400374 399995 399335 400072 0 0 0 0 400385 399843 106 | 400104 400471 399233 400475 0 0 0 0 400297 400425 107 | 400223 400293 399688 400143 0 0 0 0 399452 399838 108 | 399992 399848 400191 400330 0 0 0 0 399518 399474 109 | 400233 399813 399763 399963 0 0 0 0 399609 399535 110 | 400184 399535 399542 399792 0 0 0 0 399448 400342 111 | 400232 399774 400336 399341 0 0 0 0 400086 399904 112 | 400506 399688 400508 399797 0 0 0 0 400732 399578 113 | 399437 400492 400919 400138 0 0 0 0 400007 399357 114 | 399754 400131 399766 400662 0 0 0 0 399490 400600 115 | 399831 400710 400035 399515 0 0 0 0 400689 400828 116 | 400599 399894 400006 400367 0 0 0 0 400079 400664 117 | 399888 400324 400145 399164 0 0 0 0 400024 400000 118 | 399091 398969 400320 400155 0 0 0 0 400743 400461 119 | 399668 400133 399306 399754 0 0 0 0 399383 400323 120 | 399252 400237 400555 400410 0 0 0 0 400724 400075 ...
The only issue is I need to change the single element (byte/word/long) read from the simple single mailbox write into 2 mailbox writes.
The single element read request format is currently this in HUB RAM:
mailbox + 0 : read request (byte/word/long) | external address
mailbox + 4 : don't care
mailbox + 8 : don't care
To support a read/modify/write request I would need to change it to this:
mailbox + 0 : read request (byte/word/long) | external address
mailbox + 4 : new data value
mailbox + 8 : mask
The completion of the read code path would be altered to examine the mailbox+8 long (mask) to see if it was zero or not. If it is zero it would complete the read as normal and the data would be returned in mailbox+4 in HUB. If the mask was non zero it would be applied to the just read value and the relevant bits in new data value would be updated according to the mask bits (either with SETQ/MUXQ or AND/OR etc) and written back to the address just read using the same element size.
Importantly the original read value would still be returned in mailbox+4. This allows a read-update to be supported for semaphores etc. Eg. you try to set a bit and see it it was already a 1 or a 0 before you set it, indicating it is already in use. I would always run this read-update cycle as a back to back operation on the bus so no other COGs could affect the change. This feature would also be very handy for graphics updates of pixel data that is smaller than a byte and it avoids multiple mailbox request to do this and any associated polling delay between them.
The only downside to this approach is that normal reads which used to be an easy matter of writing a single long to the first mailbox to trigger them would now need to ensure that the mask mailbox entry is also cleared to zero in case it has been changed by any other request such as a write since the last read was done. So it typically adds an extra write by the client. This is mainly of concern to PASM clients not so much SPIN2 clients as I will have the SPIN2 API do it for you. Eg. just one new line gets added to the code below. The extra overhead time is probably not that big a deal given the performance of single reads is already limited by much larger overhead and you'll typically want to use burst reads anyway, but it is still annoying and I am still deciding whether this change is worth it. Until I code it up and make sure it fits in the freed space I guess it is moot. It will be really tight. But it could be good.
Any thoughts?
PUB readWord(addr) : r | m if MAX_INSTANCES == 1 ' optimization for single instance, everything mapped to single bus m := mailboxAddrCog[cogid()] ' get mailbox base address for this COG if m == 0 ' prevent hang if driver is not running return -1 else ' multiple buses, need to lookup address to find mailbox for bus m := addrMap[addr>>24] if m +> MAX_INSTANCES-1 ' if address not mapped, exit return -1 m := mailboxAddr[m] + cogid()*12 long[m+2] := 0 '<------------------ NEW LINE NEEDED TO AVOID READ-MODIFY-WRITE long[m] := R_READWORD + (addr & $fffffff) repeat until long[m] >= 0 return long[m][1]
Yes that third mailbox long would remain at zero after single reads, so if you did multiple single reads in a row, you could avoid the clearing each time after being setup just once at the start. The PASM will know what it is doing so it can make the decision on what to do as needed. That can be helpful. I just liked that single long mailbox write to trigger a read, it was so simple.
Yeah there are plenty of features that could be removed/hard coded to speed up the whole thing. I expect it can/should be done after the main code is complete as it is simpler to remove than to hand craft in once you know what it needs to do and I've also mentioned this in the past. But the features I've included in the full version should be quite useful especially for the combined HyperFlash + HyperRam case with the P2 EVAL module, as well as for GUI and external memory graphics. We trade off a bit of performance for this versatility. For medium sized transfer bursts it won't make a huge difference but for individual random access use we could certainly speed it up with a cut down variant.
I'm also thinking it would be cool to have some type of XMM model like we used to have on the P1, but somehow using the HyperRAM and/or HyperFlash with caching. But I don't know how it could work yet, and whether it could makes use of Hub exec or not. Without caching the performance won't be good, but with caching enabled it might end up working out okay for running very large programs. Something for the future...
First because of the use to do masked writes down to bit level, that is a very useful addition to byte/word/long. And second the semaphore thing. Not sure where I would need it but it seems to be quite useful too.
Enjoy!
Mike
I myself have also used the XMM moniker for a model where the code is fully aware that it and any external data it might want to use is being sliced into 512 byte pages and moved to hub for processing. This is fast because I can arrange the code to minimize external memory access and after being moved to Hub, code runs at usual LMM speed, because the assembler forces the last instruction in any page to always be a "jmp #vm_nextpage" or an unconditional jump/return. This model is much harder to support in a compiler though.
Someone here in the forum got it running on a P1, forgot the name, one somehow needed to add some compile time switches and tweak the linker script?
Mike
rep #3, #0 setq #2 ' 2 clocks rdlong request, mboxaddr ' 11 clocks for reading 3 longs once aligned to hub tjs request, #service_request ' 2 clocks
Looking at the total code saved in the read path I'd roughly estimate a doubling of the request rate could just about be had for the cut down driver for single element transfers. So let's say ~2M/s instead of ~1M/s at about 250MHz or so.
As it is right now the single element read code path is about 84 instructions long plus the mailbox polling loop which varies with the number of active COGs but is 40 cycles at best for a single COG for this comparison. As well as the HyperRAM transfer itself this code path currently reads and sets up the different bank control pins, reads and applies per bank burst settings from LUT, extracts per COG mailbox settings and other state, applies per bank latencies and read delays, applies per COG burst settings, tests for list requests, and sets up round robin fairness for the next poll. A minimalist implementation would remove all of this and it would then be in the vicinity of 44 instructions plus its tighter shorter polling loop.
For burst transfers this gain will be reduced as the size increases because some extra work gets done during the actual transfer time itself, but there will still be some gains there too.
I do think a cut down driver for tightly coupled applications could be useful to include as well as the fully featured one. It just would not be as useful for graphics or for multiple COGs sharing the common memory. You could also only use either the HyperFlash or the HyperRAM in your application, not both, unless perhaps two driver COGs were spawned and they were never active at the same time, carefully controlled by the application COG using it.