I think the narrowing low down in frequency is because they fail right at my breakpoints designed for RAM ranges. If I tune these differently specifically they may work better and be wider again. Not sure.
Now why it continues to work at higher frequencies I'm not sure. It might be my test itself. I only check 16 bytes and these are the same simple counting pattern at the same address. I need to generate a better flash test pattern to read to be sure it is working right.
Did you ever tryed to rely on Hybrid Bursts to easy the task of dealing with those page boundary crossings (and the lack (or absence) of RWDS (and, sure, also valid data) they introduce), so as to avoid lefting gaps at Hub Ram?
I'm only comparing to prior hyperRAM behaviour. The same narrowing at higher frequencies happened there. What didn't happen was the extra wide highest band.
EDIT: I'm questioning the validity of the "287-360 MHz ok" result. My interest in testing hyperFash myself is near zero. I see it as a burden on the HR performance.
EDIT2: Bah! Those forum icons are too close in looks. I mistook both posts as Roger's.
Did you ever tryed to rely on Hybrid Bursts to easy the task of dealing with those page boundary crossings (and the lack (or absence) of RWDS (and, sure, also valid data) they introduce), so as to avoid lefting gaps at Hub Ram?
No as of right now I was just using linear bursts. Hmm, does hybrid burst mode fully solve it? I guess you need to discard those extra bytes read at the start before it reaches your desired bytes and then starts linear bursting again. That may solve it, and I just need to increase the delay before reading from the streamer. I'll look into it, that may fix at the expense of some additional latency on flash reading, great tip thanks Yanomami.
EDIT: Actually no, I think this hybrid thing works a little differently. You get your desired bytes at the end of the page then a gap while the remaining bytes at the start of the page arrive, then the next page starts linearly after that. So it still requires two streamer commands spaced apart to work I think. Not just an initial delay at the start which I was hoping for. I think I might still have to break apart my bursts into two transactions to fit the model...
I've been working on this flash linear burst page size alignment issue today. I think I've managed to squeeze it in, but I am still testing the idea. It is both helped and complicated by the fact that burst reads can get fragmented.
The way it works is that the first burst read from flash has to compute the bytes remaining in the first 16 byte page being read. If it is the full 16 bytes being read (ie. address least significant nibble = 0) then reads continue as normal, otherwise the first burst fragment size is set to the remaining bytes left in the page, and that proceeds to complete, then any remaining read portions get fragmented as normal. The SPIN2 driver code can ensure that the burst size for flash banks is aligned to a multiple of 16 bytes, and any per COG limit should take this into account too. That way the remainder of any read burst will always resume on another page boundary and the gap problem should go away. For the highest performance the flash can be accessed on page aligned boundaries to avoid the slight extra overhead, but this scheme will allow it to read any number of bytes at any address and not leave gaps in hub RAM. I think this is important to achieve in the driver.
The new code for this takes up about 7 longs and has now totally filled up my LUT RAM. I've had to start to shuffle things around to make room. I think I only have about 9 spare longs left in COG RAM now, and that's it. I might need to hunt for a few more longs soon, especially if I find any errors in the LUT RAM code.
I've not been able to include the single reads of flash longs and words that cross page boundaries in this scheme, only the bursts. So ideally that should not be done when reading from flash, or if they are required then setup a read burst of 2 or 4 bytes instead. I could possibly put this extra page crossing check in the SPIN driver, but it will of course also impact any single RAM reads slightly during the validation phase. TBD.
Writes work a little differently, as only either single words are writeable or bursts. The bursts have to remain within 256 words (512 bytes) and not cross a 512 byte buffer boundary. Written words should already be word aligned and I return an error if they fall on odd addresses.
Request lists will be able to read from flash but writing to it within request lists using special commands like fills and graphics writes will be a problem and is prevented in the driver. Flash sector writing is a special case anyway and requires extra setup commands to enable it so normal word and burst writes can and should still be used for that purpose.
I added in the following final features and now I'm 100% full for both LUT and COG RAM. It's chock-a-block!
1) List cancellation. Client COG can instruct driver to cleanly stop at end of current list item being processed. May be useful for clean shutdowns. Client clears top bit of list request in first mailbox long, driver COG will write zero to this address and stop once it detects this bit is cleared, and it gets polled at end of each list item before advancing. Same mailbox long can also be used to monitor the status of request list progression as it gets updated with the current list item's address being processed by the driver.
2) Prevention of writes to flash in lists when running extended requests such as graphics fills and copies. The single word writes and single flash write bursts can still be put in a request list however. Also HyperFlash can still be read from for graphics operations such as image copies, copies into HyperRAM, wavetable data etc, within request lists.
3) The flash burst read fix for crossing 16 byte page boundaries in HyperFlash has been added. There should now be no gaps for any read address / length as long as the configured burst sizes remain multiples of this page size.
4) Automatic long/word memory address alignment (P1 style addressing) for any atomic 16 or 32 bit HyperFlash word and long reads. This prevents crossing page boundaries too which is good. HyperRAM can still be accessed at any byte address for reads/writes of bytes/words/longs (P2 style addressing), making it more versatile. Flash word writes to unaligned word addresses or an odd number of bytes written will also be detected and return an un-aligned error because the HyperFlash needs 16 bit writes or multiples of that sent each time. Individual byte or long write requests to HyperFlash are not supported and those commands will fail.
Now I just need to validate this and it's then feature complete. I can't fit anything else in! Any bugs introduced here that require further instructions to remedy are going to really challenge me.
@evanh, I retested HyperFlash again and printed out the delay values that work at each frequency (tested from 3-9). If the delay LSB = 1 it means unregistered data bus pins, and 0 = registered data bus pins . Clock pin was unregistered. Results are attached with good delays that read back the expected pattern with sysclk/1 reads from HyperFlash, and delays that failed are omitted from the set. There seems to be reasonable overlap.
I've edited out some repeating values to keep the post size down. In fact the middle frequency of the cut points (...) where it overlaps would make sense to change from one delay value to the next. That makes it around 58MHz, 107MHz, 155MHz, 215MHz, 265MHz, 310MHz. Seems reasonably balanced.
Flash good at 25 MHz - good delays are: 3 4
Flash good at 26 MHz - good delays are: 3 4
Flash good at 27 MHz - good delays are: 3 4
Flash good at 28 MHz - good delays are: 3 4
Flash good at 29 MHz - good delays are: 3 4
...
Flash good at 88 MHz - good delays are: 3 4
Flash good at 89 MHz - good delays are: 3 4
Flash good at 90 MHz - good delays are: 3 4
Flash good at 91 MHz - good delays are: 3 4
Flash good at 92 MHz - good delays are: 3 4
Flash good at 93 MHz - good delays are: 4
Flash good at 94 MHz - good delays are: 4
Flash good at 95 MHz - good delays are: 4 5
Flash good at 96 MHz - good delays are: 4 5
Flash good at 97 MHz - good delays are: 4 5
Flash good at 98 MHz - good delays are: 4 5
Flash good at 99 MHz - good delays are: 4 5
...
Flash good at 115 MHz - good delays are: 4 5
Flash good at 116 MHz - good delays are: 4 5
Flash good at 117 MHz - good delays are: 4 5
Flash good at 118 MHz - good delays are: 4 5
Flash good at 119 MHz - good delays are: 5
Flash good at 120 MHz - good delays are: 5
Flash good at 121 MHz - good delays are: 5
Flash good at 122 MHz - good delays are: 5
Flash good at 123 MHz - good delays are: 5
Flash good at 124 MHz - good delays are: 5
Flash good at 125 MHz - good delays are: 5 6
Flash good at 126 MHz - good delays are: 5 6
Flash good at 127 MHz - good delays are: 5 6
Flash good at 128 MHz - good delays are: 5 6
...
Flash good at 183 MHz - good delays are: 5 6
Flash good at 184 MHz - good delays are: 5 6
Flash good at 185 MHz - good delays are: 5 6
Flash good at 186 MHz - good delays are: 5 6
Flash good at 187 MHz - good delays are: 6
Flash good at 188 MHz - good delays are: 6
Flash good at 189 MHz - good delays are: 6
Flash good at 190 MHz - good delays are: 6
Flash good at 191 MHz - good delays are: 6
Flash good at 192 MHz - good delays are: 6 7
Flash good at 193 MHz - good delays are: 6 7
Flash good at 194 MHz - good delays are: 6 7
Flash good at 195 MHz - good delays are: 6 7
...
Flash good at 233 MHz - good delays are: 6 7
Flash good at 234 MHz - good delays are: 6 7
Flash good at 235 MHz - good delays are: 6 7
Flash good at 236 MHz - good delays are: 6 7
Flash good at 237 MHz - good delays are: 7
Flash good at 238 MHz - good delays are: 7
Flash good at 239 MHz - good delays are: 7
Flash good at 240 MHz - good delays are: 7
Flash good at 241 MHz - good delays are: 7
Flash good at 242 MHz - good delays are: 7
Flash good at 243 MHz - good delays are: 7
Flash good at 244 MHz - good delays are: 7
Flash good at 245 MHz - good delays are: 7
Flash good at 246 MHz - good delays are: 7
Flash good at 247 MHz - good delays are: 7
Flash good at 248 MHz - good delays are: 7
Flash good at 249 MHz - good delays are: 7
Flash good at 250 MHz - good delays are: 7 8
Flash good at 251 MHz - good delays are: 7 8
Flash good at 252 MHz - good delays are: 7 8
Flash good at 253 MHz - good delays are: 7 8
...
Flash good at 276 MHz - good delays are: 7 8
Flash good at 277 MHz - good delays are: 7 8
Flash good at 278 MHz - good delays are: 7 8
Flash good at 279 MHz - good delays are: 7 8
Flash good at 280 MHz - good delays are: 8
Flash good at 281 MHz - good delays are: 8
Flash good at 282 MHz - good delays are: 8
Flash good at 283 MHz - good delays are: 8
Flash good at 284 MHz - good delays are: 8
Flash good at 285 MHz - good delays are: 8
Flash good at 286 MHz - good delays are: 8 9
Flash good at 287 MHz - good delays are: 8 9
Flash good at 288 MHz - good delays are: 8 9
Flash good at 289 MHz - good delays are: 8 9
...
Flash good at 331 MHz - good delays are: 8 9
Flash good at 332 MHz - good delays are: 8 9
Flash good at 333 MHz - good delays are: 8 9
Flash good at 334 MHz - good delays are: 8 9
Flash good at 335 MHz - good delays are: 9
Flash good at 336 MHz - good delays are: 9
Flash good at 337 MHz - good delays are: 9
Flash good at 338 MHz - good delays are: 9
...
Flash good at 355 MHz - good delays are: 9
Flash good at 356 MHz - good delays are: 9
Flash good at 357 MHz - good delays are: 9
Flash good at 358 MHz - good delays are: 9
Flash good at 359 MHz - good delays are: 9
Flash good at 360 MHz - good delays are: 9
Here is this graphically for both RAM and Flash showing the overlapping ranges that worked from 25MHz to 360MHz for sysclk/1 reads. Interesting that the RAM is different and is the one that has narrower bands. I noticed in the data sheets that the default drive strength impedance is 27 ohms for the HyperFlash vs 34 ohms for HyperRAM. Maybe that makes a slight difference...?
I retested HyperFlash again and printed out the delay values that work at each frequency (tested from 3-9).
Nice tables, I wonder how those move with temperature and if there is a single value that can be applied over a practical temperature range, or if this needs temp sense and live adjust (which would be more of a pain).
I've seen newer RAM parts specify ROM pattern areas, that can at least assist with bus tuning - maybe these issues are more widespread ?
... I noticed in the data sheets that the default drive strength impedance is 27 ohms for the HyperFlash vs 34 ohms for HyperRAM. Maybe that makes a slight difference...?
Damn good point, I sure hope so. The difference sure is dramatic imho.
... I noticed in the data sheets that the default drive strength impedance is 27 ohms for the HyperFlash vs 34 ohms for HyperRAM. Maybe that makes a slight difference...?
Damn good point, I sure hope so. The difference sure is dramatic imho.
Well I just now set the HyperRAM CR0 regs to 27 ohms impedance, and the profile looks the same as it was for sysclk/1 though - it still doesn't match the HyperFlash which is a pity.
I retested HyperFlash again and printed out the delay values that work at each frequency (tested from 3-9).
Nice tables, I wonder how those move with temperature and if there is a single value that can be applied over a practical temperature range, or if this needs temp sense and live adjust (which would be more of a pain).
I've seen newer RAM parts specify ROM pattern areas, that can at least assist with bus tuning - maybe these issues are more widespread ?
At room temp I think it is somewhat stable, but the temp extremes do show variation and evanh did some earlier work on that. A ROM pattern would be nice. In theory the RAM could be scanned through at init time if someone wanted to try different delay values, the problem is you'll get two or one working read delay values, and when it is two, you don't know which one is best unless you actually scan over the frequency range at that temperature.
It's probably easiest to assume linear variation over temp and change your delay accordingly. If you know the current chip/board temp and it changes slowly you could have a COG adapting the driver delay. For sysclk/2 it's probably less of an issue given how wide the bands are for that but still possibly could have an impact at some point.
Writes don't show this timing problem at sysclk/2 thankfully as the clock is centered in the middle of the bit.
@"Dave Hein" No it needs to be a single value. This is how the delay parameter is used below in the read code. The "delay" register here is actually the number in the charts above divided by two, because the LSB is used elsewhere to gain a half step of delay by selecting between registered vs live I/O input (regdatabus) which introduces a small amount of extra delay and is ideal to be able to transition between bands. If we didn't have that there would be some frequencies that become unusable with HyperRAM (at sysclk/1 input rates).
Basically if you set the delay too high and wait too long to start the streamer you miss the first byte(s) coming back from the HyperRAM. If you see the delay too low then you don't wait long enough and the streamer will clock in $FF from the undriven data bus before the HyperRAM has a chance to respond to the clock you are sending it. So there is a sweet spot. Unfortunately as well as varying with the P2 clock rate, it also varies with temperature as @evanh found.
wxpin clockdiv, clkpin 'adjust transition delay to # clocks
setxfrq xfreqr 'setup streamer frequency
wypin clks, clkpin 'setup number of transfer clocks
wrpin regdatabus, datapins 'setup data bus inputs as registered or not
waitx delay 'tuning delay for input data reading
xinit xrecv, #0 'start data transfer and then jump to setup code
It also may vary from chip to chip, or from batch to batch, or maybe varies depending on which pin is used. Tweaked code based on the timing of a few chips is a little scary.
Dave,
It is hairy, but it's not that bad. Chip fabrication is very consistent. The biggest variability is temperature ... And board layout. A different board will give different outcome unless they all have a spec to conform to. That's something I'm keen to have except I don't have the knowledge or experience myself.
JMG or Mark T might have the knowledge and experience. Von is working on the revC Eval Board with this sort of thing in mind but I'm not sure what can be achieved with general expansion headers compared to a dedicated hyperRAM right next to the prop2.
... Or are you saying that 7 and 8 are the smallest delays that work, and in that case wouldn't it just be 7?
The potential number of choices for the compensation is dependant on the ratio to sysclock. If it's sysclock/1 then there can only be one value that works at any given frequency, if that. For sysclock/2 there is potential for two workable compensations at any given frequency. And on it goes for /3 having three compensations, /4 having four compensations ...
Here's a sysclock/4 example of read data. You can see the first shift of reliable operation occurs begins just above 80 MHz. For a short band it only has three working compensations. If that was sysclock/1 there wouldn't be any working compensation value for a short band.
I have managed to scrounge a few more longs in the code by sharing registers in different places etc and think with some effort I might free up just enough to get Read-Modify-Write supported as one final feature of this driver.
The only issue is I need to change the single element (byte/word/long) read from the simple single mailbox write into 2 mailbox writes.
The single element read request format is currently this in HUB RAM:
mailbox + 0 : read request (byte/word/long) | external address
mailbox + 4 : don't care
mailbox + 8 : don't care
To support a read/modify/write request I would need to change it to this:
mailbox + 0 : read request (byte/word/long) | external address
mailbox + 4 : new data value
mailbox + 8 : mask
The completion of the read code path would be altered to examine the mailbox+8 long (mask) to see if it was zero or not. If it is zero it would complete the read as normal and the data would be returned in mailbox+4 in HUB. If the mask was non zero it would be applied to the just read value and the relevant bits in new data value would be updated according to the mask bits (either with SETQ/MUXQ or AND/OR etc) and written back to the address just read using the same element size.
Importantly the original read value would still be returned in mailbox+4. This allows a read-update to be supported for semaphores etc. Eg. you try to set a bit and see it it was already a 1 or a 0 before you set it, indicating it is already in use. I would always run this read-update cycle as a back to back operation on the bus so no other COGs could affect the change. This feature would also be very handy for graphics updates of pixel data that is smaller than a byte and it avoids multiple mailbox request to do this and any associated polling delay between them.
The only downside to this approach is that normal reads which used to be an easy matter of writing a single long to the first mailbox to trigger them would now need to ensure that the mask mailbox entry is also cleared to zero in case it has been changed by any other request such as a write since the last read was done. So it typically adds an extra write by the client. This is mainly of concern to PASM clients not so much SPIN2 clients as I will have the SPIN2 API do it for you. Eg. just one new line gets added to the code below. The extra overhead time is probably not that big a deal given the performance of single reads is already limited by much larger overhead and you'll typically want to use burst reads anyway, but it is still annoying and I am still deciding whether this change is worth it. Until I code it up and make sure it fits in the freed space I guess it is moot. It will be really tight. But it could be good.
Any thoughts?
PUB readWord(addr) : r | m
if MAX_INSTANCES == 1 ' optimization for single instance, everything mapped to single bus
m := mailboxAddrCog[cogid()] ' get mailbox base address for this COG
if m == 0 ' prevent hang if driver is not running
return -1
else ' multiple buses, need to lookup address to find mailbox for bus
m := addrMap[addr>>24]
if m +> MAX_INSTANCES-1 ' if address not mapped, exit
return -1
m := mailboxAddr[m] + cogid()*12
long[m+2] := 0 '<------------------ NEW LINE NEEDED TO AVOID READ-MODIFY-WRITE
long[m] := R_READWORD + (addr & $fffffff)
repeat until long[m] >= 0
return long[m][1]
Given that each cog has it's own mailbox, PASM code that never sets that value can just ignore it or clear it only when it knows that it needs clearing
(Also, speaking of overhead, I think there'd be tremendous value in a cut-down, low-overhead driver. Only one mailbox, one RAM bank,etc, so you can do fast-ish small accesses (like one would need for XMM, emulators, etc.).
Given that each cog has it's own mailbox, PASM code that never sets that value can just ignore it or clear it only when it knows that it needs clearing
Yes that third mailbox long would remain at zero after single reads, so if you did multiple single reads in a row, you could avoid the clearing each time after being setup just once at the start. The PASM will know what it is doing so it can make the decision on what to do as needed. That can be helpful. I just liked that single long mailbox write to trigger a read, it was so simple.
(Also, speaking of overhead, I think there'd be tremendous value in a cut-down, low-overhead driver. Only one mailbox, one RAM bank,etc, so you can do fast-ish small accesses (like one would need for XMM, emulators, etc.).
Yeah there are plenty of features that could be removed/hard coded to speed up the whole thing. I expect it can/should be done after the main code is complete as it is simpler to remove than to hand craft in once you know what it needs to do and I've also mentioned this in the past. But the features I've included in the full version should be quite useful especially for the combined HyperFlash + HyperRam case with the P2 EVAL module, as well as for GUI and external memory graphics. We trade off a bit of performance for this versatility. For medium sized transfer bursts it won't make a huge difference but for individual random access use we could certainly speed it up with a cut down variant.
I'm also thinking it would be cool to have some type of XMM model like we used to have on the P1, but somehow using the HyperRAM and/or HyperFlash with caching. But I don't know how it could work yet, and whether it could makes use of Hub exec or not. Without caching the performance won't be good, but with caching enabled it might end up working out okay for running very large programs. Something for the future...
First because of the use to do masked writes down to bit level, that is a very useful addition to byte/word/long. And second the semaphore thing. Not sure where I would need it but it seems to be quite useful too.
The usual P1 XMM model (as implemented in GCC and probably similiar in Catalina, but IDK ask @RossH ) is mostly transparent to the running code - it just has to use special function calls for jumps and any memory access that may be external. This is relatively easy to hack onto an existing compiler, but is very very slow, because every instruction fetch goes through a bounds check to determine if a it crosses into the next cache line and every jump needs to be a function call. This approach would be even slower on P2, because hubexec could not be used (OR COULD IT? The hardware breakpoint could make it work!).
I myself have also used the XMM moniker for a model where the code is fully aware that it and any external data it might want to use is being sliced into 512 byte pages and moved to hub for processing. This is fast because I can arrange the code to minimize external memory access and after being moved to Hub, code runs at usual LMM speed, because the assembler forces the last instruction in any page to always be a "jmp #vm_nextpage" or an unconditional jump/return. This model is much harder to support in a compiler though.
A single mailbox, single bank cut down driver without any special features like lists, fills, multi-bank copies, graphics transfers, register access, burst/latency control etc could be sped up quite a bit. The polling loop could also come down to within 16 clocks once aligned with the egg-beater, and you can still get all 3 longs read in the one poll loop, saving any additional reads later. It fits nicely:
rep #3, #0
setq #2 ' 2 clocks
rdlong request, mboxaddr ' 11 clocks for reading 3 longs once aligned to hub
tjs request, #service_request ' 2 clocks
Looking at the total code saved in the read path I'd roughly estimate a doubling of the request rate could just about be had for the cut down driver for single element transfers. So let's say ~2M/s instead of ~1M/s at about 250MHz or so.
As it is right now the single element read code path is about 84 instructions long plus the mailbox polling loop which varies with the number of active COGs but is 40 cycles at best for a single COG for this comparison. As well as the HyperRAM transfer itself this code path currently reads and sets up the different bank control pins, reads and applies per bank burst settings from LUT, extracts per COG mailbox settings and other state, applies per bank latencies and read delays, applies per COG burst settings, tests for list requests, and sets up round robin fairness for the next poll. A minimalist implementation would remove all of this and it would then be in the vicinity of 44 instructions plus its tighter shorter polling loop.
For burst transfers this gain will be reduced as the size increases because some extra work gets done during the actual transfer time itself, but there will still be some gains there too.
I do think a cut down driver for tightly coupled applications could be useful to include as well as the fully featured one. It just would not be as useful for graphics or for multiple COGs sharing the common memory. You could also only use either the HyperFlash or the HyperRAM in your application, not both, unless perhaps two driver COGs were spawned and they were never active at the same time, carefully controlled by the application COG using it.
Comments
Now why it continues to work at higher frequencies I'm not sure. It might be my test itself. I only check 16 bytes and these are the same simple counting pattern at the same address. I need to generate a better flash test pattern to read to be sure it is working right.
Did you ever tryed to rely on Hybrid Bursts to easy the task of dealing with those page boundary crossings (and the lack (or absence) of RWDS (and, sure, also valid data) they introduce), so as to avoid lefting gaps at Hub Ram?
EDIT: I'm questioning the validity of the "287-360 MHz ok" result. My interest in testing hyperFash myself is near zero. I see it as a burden on the HR performance.
EDIT2: Bah! Those forum icons are too close in looks. I mistook both posts as Roger's.
No as of right now I was just using linear bursts. Hmm, does hybrid burst mode fully solve it? I guess you need to discard those extra bytes read at the start before it reaches your desired bytes and then starts linear bursting again. That may solve it, and I just need to increase the delay before reading from the streamer. I'll look into it, that may fix at the expense of some additional latency on flash reading, great tip thanks Yanomami.
EDIT: Actually no, I think this hybrid thing works a little differently. You get your desired bytes at the end of the page then a gap while the remaining bytes at the start of the page arrive, then the next page starts linearly after that. So it still requires two streamer commands spaced apart to work I think. Not just an initial delay at the start which I was hoping for. I think I might still have to break apart my bursts into two transactions to fit the model...
The way it works is that the first burst read from flash has to compute the bytes remaining in the first 16 byte page being read. If it is the full 16 bytes being read (ie. address least significant nibble = 0) then reads continue as normal, otherwise the first burst fragment size is set to the remaining bytes left in the page, and that proceeds to complete, then any remaining read portions get fragmented as normal. The SPIN2 driver code can ensure that the burst size for flash banks is aligned to a multiple of 16 bytes, and any per COG limit should take this into account too. That way the remainder of any read burst will always resume on another page boundary and the gap problem should go away. For the highest performance the flash can be accessed on page aligned boundaries to avoid the slight extra overhead, but this scheme will allow it to read any number of bytes at any address and not leave gaps in hub RAM. I think this is important to achieve in the driver.
The new code for this takes up about 7 longs and has now totally filled up my LUT RAM. I've had to start to shuffle things around to make room. I think I only have about 9 spare longs left in COG RAM now, and that's it. I might need to hunt for a few more longs soon, especially if I find any errors in the LUT RAM code.
I've not been able to include the single reads of flash longs and words that cross page boundaries in this scheme, only the bursts. So ideally that should not be done when reading from flash, or if they are required then setup a read burst of 2 or 4 bytes instead. I could possibly put this extra page crossing check in the SPIN driver, but it will of course also impact any single RAM reads slightly during the validation phase. TBD.
Writes work a little differently, as only either single words are writeable or bursts. The bursts have to remain within 256 words (512 bytes) and not cross a 512 byte buffer boundary. Written words should already be word aligned and I return an error if they fall on odd addresses.
Request lists will be able to read from flash but writing to it within request lists using special commands like fills and graphics writes will be a problem and is prevented in the driver. Flash sector writing is a special case anyway and requires extra setup commands to enable it so normal word and burst writes can and should still be used for that purpose.
1) List cancellation. Client COG can instruct driver to cleanly stop at end of current list item being processed. May be useful for clean shutdowns. Client clears top bit of list request in first mailbox long, driver COG will write zero to this address and stop once it detects this bit is cleared, and it gets polled at end of each list item before advancing. Same mailbox long can also be used to monitor the status of request list progression as it gets updated with the current list item's address being processed by the driver.
2) Prevention of writes to flash in lists when running extended requests such as graphics fills and copies. The single word writes and single flash write bursts can still be put in a request list however. Also HyperFlash can still be read from for graphics operations such as image copies, copies into HyperRAM, wavetable data etc, within request lists.
3) The flash burst read fix for crossing 16 byte page boundaries in HyperFlash has been added. There should now be no gaps for any read address / length as long as the configured burst sizes remain multiples of this page size.
4) Automatic long/word memory address alignment (P1 style addressing) for any atomic 16 or 32 bit HyperFlash word and long reads. This prevents crossing page boundaries too which is good. HyperRAM can still be accessed at any byte address for reads/writes of bytes/words/longs (P2 style addressing), making it more versatile. Flash word writes to unaligned word addresses or an odd number of bytes written will also be detected and return an un-aligned error because the HyperFlash needs 16 bit writes or multiples of that sent each time. Individual byte or long write requests to HyperFlash are not supported and those commands will fail.
Now I just need to validate this and it's then feature complete. I can't fit anything else in! Any bugs introduced here that require further instructions to remedy are going to really challenge me.
I've edited out some repeating values to keep the post size down. In fact the middle frequency of the cut points (...) where it overlaps would make sense to change from one delay value to the next. That makes it around 58MHz, 107MHz, 155MHz, 215MHz, 265MHz, 310MHz. Seems reasonably balanced.
I've seen newer RAM parts specify ROM pattern areas, that can at least assist with bus tuning - maybe these issues are more widespread ?
At room temp I think it is somewhat stable, but the temp extremes do show variation and evanh did some earlier work on that. A ROM pattern would be nice. In theory the RAM could be scanned through at init time if someone wanted to try different delay values, the problem is you'll get two or one working read delay values, and when it is two, you don't know which one is best unless you actually scan over the frequency range at that temperature.
It's probably easiest to assume linear variation over temp and change your delay accordingly. If you know the current chip/board temp and it changes slowly you could have a COG adapting the driver delay. For sysclk/2 it's probably less of an issue given how wide the bands are for that but still possibly could have an impact at some point.
Writes don't show this timing problem at sysclk/2 thankfully as the clock is centered in the middle of the bit.
Or are you saying that 7 and 8 are the smallest delays that work, and in that case wouldn't it just be 7?
Basically if you set the delay too high and wait too long to start the streamer you miss the first byte(s) coming back from the HyperRAM. If you see the delay too low then you don't wait long enough and the streamer will clock in $FF from the undriven data bus before the HyperRAM has a chance to respond to the clock you are sending it. So there is a sweet spot. Unfortunately as well as varying with the P2 clock rate, it also varies with temperature as @evanh found.
It is hairy, but it's not that bad. Chip fabrication is very consistent. The biggest variability is temperature ... And board layout. A different board will give different outcome unless they all have a spec to conform to. That's something I'm keen to have except I don't have the knowledge or experience myself.
JMG or Mark T might have the knowledge and experience. Von is working on the revC Eval Board with this sort of thing in mind but I'm not sure what can be achieved with general expansion headers compared to a dedicated hyperRAM right next to the prop2.
The only issue is I need to change the single element (byte/word/long) read from the simple single mailbox write into 2 mailbox writes.
The single element read request format is currently this in HUB RAM:
mailbox + 0 : read request (byte/word/long) | external address
mailbox + 4 : don't care
mailbox + 8 : don't care
To support a read/modify/write request I would need to change it to this:
mailbox + 0 : read request (byte/word/long) | external address
mailbox + 4 : new data value
mailbox + 8 : mask
The completion of the read code path would be altered to examine the mailbox+8 long (mask) to see if it was zero or not. If it is zero it would complete the read as normal and the data would be returned in mailbox+4 in HUB. If the mask was non zero it would be applied to the just read value and the relevant bits in new data value would be updated according to the mask bits (either with SETQ/MUXQ or AND/OR etc) and written back to the address just read using the same element size.
Importantly the original read value would still be returned in mailbox+4. This allows a read-update to be supported for semaphores etc. Eg. you try to set a bit and see it it was already a 1 or a 0 before you set it, indicating it is already in use. I would always run this read-update cycle as a back to back operation on the bus so no other COGs could affect the change. This feature would also be very handy for graphics updates of pixel data that is smaller than a byte and it avoids multiple mailbox request to do this and any associated polling delay between them.
The only downside to this approach is that normal reads which used to be an easy matter of writing a single long to the first mailbox to trigger them would now need to ensure that the mask mailbox entry is also cleared to zero in case it has been changed by any other request such as a write since the last read was done. So it typically adds an extra write by the client. This is mainly of concern to PASM clients not so much SPIN2 clients as I will have the SPIN2 API do it for you. Eg. just one new line gets added to the code below. The extra overhead time is probably not that big a deal given the performance of single reads is already limited by much larger overhead and you'll typically want to use burst reads anyway, but it is still annoying and I am still deciding whether this change is worth it. Until I code it up and make sure it fits in the freed space I guess it is moot. It will be really tight. But it could be good.
Any thoughts?
Yes that third mailbox long would remain at zero after single reads, so if you did multiple single reads in a row, you could avoid the clearing each time after being setup just once at the start. The PASM will know what it is doing so it can make the decision on what to do as needed. That can be helpful. I just liked that single long mailbox write to trigger a read, it was so simple.
Yeah there are plenty of features that could be removed/hard coded to speed up the whole thing. I expect it can/should be done after the main code is complete as it is simpler to remove than to hand craft in once you know what it needs to do and I've also mentioned this in the past. But the features I've included in the full version should be quite useful especially for the combined HyperFlash + HyperRam case with the P2 EVAL module, as well as for GUI and external memory graphics. We trade off a bit of performance for this versatility. For medium sized transfer bursts it won't make a huge difference but for individual random access use we could certainly speed it up with a cut down variant.
I'm also thinking it would be cool to have some type of XMM model like we used to have on the P1, but somehow using the HyperRAM and/or HyperFlash with caching. But I don't know how it could work yet, and whether it could makes use of Hub exec or not. Without caching the performance won't be good, but with caching enabled it might end up working out okay for running very large programs. Something for the future...
First because of the use to do masked writes down to bit level, that is a very useful addition to byte/word/long. And second the semaphore thing. Not sure where I would need it but it seems to be quite useful too.
Enjoy!
Mike
I myself have also used the XMM moniker for a model where the code is fully aware that it and any external data it might want to use is being sliced into 512 byte pages and moved to hub for processing. This is fast because I can arrange the code to minimize external memory access and after being moved to Hub, code runs at usual LMM speed, because the assembler forces the last instruction in any page to always be a "jmp #vm_nextpage" or an unconditional jump/return. This model is much harder to support in a compiler though.
Someone here in the forum got it running on a P1, forgot the name, one somehow needed to add some compile time switches and tweak the linker script?
Mike
Looking at the total code saved in the read path I'd roughly estimate a doubling of the request rate could just about be had for the cut down driver for single element transfers. So let's say ~2M/s instead of ~1M/s at about 250MHz or so.
As it is right now the single element read code path is about 84 instructions long plus the mailbox polling loop which varies with the number of active COGs but is 40 cycles at best for a single COG for this comparison. As well as the HyperRAM transfer itself this code path currently reads and sets up the different bank control pins, reads and applies per bank burst settings from LUT, extracts per COG mailbox settings and other state, applies per bank latencies and read delays, applies per COG burst settings, tests for list requests, and sets up round robin fairness for the next poll. A minimalist implementation would remove all of this and it would then be in the vicinity of 44 instructions plus its tighter shorter polling loop.
For burst transfers this gain will be reduced as the size increases because some extra work gets done during the actual transfer time itself, but there will still be some gains there too.
I do think a cut down driver for tightly coupled applications could be useful to include as well as the fully featured one. It just would not be as useful for graphics or for multiple COGs sharing the common memory. You could also only use either the HyperFlash or the HyperRAM in your application, not both, unless perhaps two driver COGs were spawned and they were never active at the same time, carefully controlled by the application COG using it.