I was able to compile my new HyperRAM driver codebase in PNut v34s running on VirtualBox. Still not tested, just compiling without errors.
The size difference vs Fastspin is interesting. Looks like the SPIN2 driver object is currently about 8kB including the 3600 byte PASM code. This probably compares to just over 13kB in Fastspin with optimisation enabled. Though the Fastspin version should still be somewhat faster to run of course. By how much, I'm keen to find out at some point.
I needed to change a few things before it compiled and this is what I learned (I'm sure it has been discussed before, but this is the first time I've ever run PNut so I'm learning the hard way when porting the driver code to be hopefully runnable using both environments):
- PNUT needs that return parameter to compile without errors if you want to return something, Fastspin doesn't need it.
PUB getHyperDriver() : r
return @hyper_driver
vs
PUB getHyperDriver() ' Fastspin allows this syntax and can still return a value
return @hyper_driver
- PNUT needs cogid to be returned via function cogid() while Fastspin allows just cogid to be used
- Fastspin allows # but PNut now always needs a dot. Eg:
driver#REQ_READBYTE ' Fastspin allows this
vs
driver.REQ_READBYTE
- There is no cognew function in PNut to spawn PASM COGs you need to use coginit with 16 as the argument to start a new COG.
driverCog := cognew(addr, @params)
vs
driverCog := coginit(16, addr, @params)
- SPIN2 method parameters can't use the same name as labels do in the PASM2 code in PNut.
- PNut requires any no-argument SPIN2 methods to be defined and called with ()
- Finally there was a problem with greater than and equal to order
PNut needs this:
repeat until long[m] >= 0
while (perhaps an older) Fastspin needed this to work correctly:
repeat until long[m] => 0
Hopefully a newer Fastspin should fix this.
Update: looks like Fastspin 4.1.9 is doing what I want now and can use the Pnut syntax...this should work according to the listing output.
00730 | ' repeat until long[m] >= 0
00730 | LR__0001
00730 81 CC 01 FB | rdlong dump_tmp001_, _dump_m
00734 00 CC 5D F2 | cmps dump_tmp001_, #0 wcz
00738 F4 FF 9F CD | if_b jmp #LR__0001
Do yourself a big favor and get the latest fastspin release
Yes, found all those.=
BTW the latest fastspin uses >= and <= too.
pnut Spin2 leaves cog $000-$131 free and PR0-PR7 is usable too (? $1D8-$1DF)
fastspin leaves cog $000-$01F free only.
Finally had a bit of success today after a lot of stupid little problems I ran into with my new SPIN2 HyperRAM driver interface. It's been one of those days and I shouldn't be staying up late and waking up as early I guess. But in any case I have been able to read/write to the HyperRAM again using this whole new interface. A lot of new code needed to execute correctly for this to work, including a small change to the PASM driver to initialise things which I wasn't expecting to cause as many issues. Still needs a lot more testing but it shows life now.
Software like this is hard if you leave it alone for too long and need to relearn it each time you come back to it.
Frequency bands for HR Read Data using Eval Board and Hyper Accessory (data pins P16-P23, clock pin P24):
1-96 MHz, 112-193 MHz, 232-288 MHz: All registered pins, no capacitor.
1-87 MHz, 107-174 MHz, 217-266 MHz: All registered pins, 22 pF capacitor on clock.
1-92 MHz, 113-183 MHz, 226-279 MHz: -5 °C, All registered pins, 22 pF capacitor on clock.
1-82 MHz, 101-165 MHz, 208-253 MHz: 55 °C, All registered pins, 22 pF capacitor on clock.
1-94 MHz, 107-188 MHz, 221-277 MHz: Registered data pins, no capacitor.
1-84 MHz, 103-168 MHz, 208-256 MHz: Registered data pins, 22 pF capacitor on clock.
1-88 MHz, 108-177 MHz, 216-268 MHz: -5 °C, Registered data pins, 22 pF capacitor on clock.
1-79 MHz, 97-159 MHz, 199-243 MHz: 55 °C, Registered data pins, 22 pF capacitor on clock.
1-78 MHz, 87-156 MHz, 173-233 MHz, 264-308 MHz: All unregistered pins, no capacitor.
1-71 MHz, 83-140 MHz, 165-214 MHz, 249-286 MHz: All unregistered pins, 22 pF capacitor on clock.
1-75 MHz, 88-147 MHz, 175-225 MHz, 260-299 MHz: -5 °C, All unregistered pins, 22 pF capacitor on clock.
1-66 MHz, 78-132 MHz, 154-200 MHz, 238-272 MHz: 55 °C, All unregistered pins, 22 pF capacitor on clock.
Looking at this again @evanh, I spent some more time on this today and added the ability to enable registered clock outputs for the Hyper bus transfers. It is just a flag passed at driver startup time and is not further modifiable at runtime at this point, although it could potentially be later to help dynamically switch between operating intervals. The registered data pin setting (used just for reads) can in theory be changed at run time, even per device on the bus in case there are slight path timing differences. I'm hoping that this will be sufficient for anyone wishing to deal with the temperature variation.
In my driver I have just setup some frequency intervals in the CON section and it can easily be altered if some suitable timing profile is known in advance. The table I am using is this (for sysclk/2 read transfer rates):
DELAY1 = 5 ' value used at/below FREQ1
FREQ1 = 90_000_000 ' in Hz
DELAY2 = 6 ' value used between FREQ1 and FREQ2
FREQ2 = 120_000_000
DELAY3 = 7
FREQ3 = 180_000_000
DELAY4 = 8
FREQ4 = 225_000_000
DELAY5 = 9
FREQ5 = 270_000_000
DELAY6 = 10 ' value used above FREQ5
The effect of these DELAYx values is to do this (the actual bits of DELAYx get split further):
P2 frequency <90MHz : use registered data pins, delay between WYPIX clock output and streamer XINIT read = 6 clocks
90-120MHz : use unregistered pins, delay = 7 clocks
120-180MHz : use registered data pins, delay = 7 clocks
180-225MHz : use unregistered pins, delay = 8 clocks
225-270MHz : use registered data pins, delay = 8 clocks
P2 frequency >270MHz : use unregistered dat pins, delay = 9 clocks
Also if reading at sysclk/1 the above clock delays need to be reduced by 1 because the phase differs by one clock cycle with respect to the streamer.
For the current writes and the address phase for both reads/writes I always leave the data pins registered to keep the timing constant there. For future sysclk/1 writes I might need to review that choice.
What worries me slightly is that boards without the suggested capacitor vs with the capacitor will have different optimal operational frequency ranges. There is no way the driver can fully know which to apply in advance which means any distributed applications/demos etc using the HyperRAM can perform quite differently in different systems depending on them having that capacitor fitted or not, even at the same temp and P2 clock speed. The addition of the capacitor for supporting future sysclk/1 writes reduces the operational range and interval overlaps slightly too, unfortunately around the 297MHz rate which is used for HDTV timing, which is not ideal. Perhaps with proper 3V rated v2 HyperRAM this may not be such an issue.
I'm thinking that using the capacitor should be reserved for special board layout with prop2 and a dedicated HR side by side. When those boards come available the read frequency bands will be entirely different anyway.
So there'll be sysclock/2 read/write reliable on all boards, and sysclock/2 writes reliable with sysclock/1 reads somewhat reliable on most boards. And fully sysclock/1 "mostly" reliable on the special boards. Mostly because there is still going to be read frequency bands, but hopefully broad enough to cater for general use without issue.
Yes that is a reasonable way to consider it evanh. It is pretty much impossible for this driver to cater for every case in advance automatically, but at least it will have some flexibility to be tweaked so people can attempt to tune it for their situation if they wish operate in frequency bands causing problems.
This information might be important for @"Peter Jakacki" particularly if he plans to fit a capacitor on his up coming P2PAL HyperRAM. Perhaps it might be sufficient just to have the footprint for one on the PCB and people could solder it on later if their want to experiment with sysclk/1 writes.
Made some decent progress on wrapping up the SPIN2/Fastspin API code for my HyperRAM driver in the last couple of days as I've been in the right frame of mind to get it done. Now I'm testing and documenting it.
I think it is quite a bit easier to use now as I have simplified some APIs and only do the bus creation internally within the driver once its first device is mapped. You can now use it as easily as one or two lines like this...
OBJ mem : "memorydriver"
' a minimalist setup...
PUB simpleStart()
' map and init HyperRAM at address 0-$ffffff, HyperFlash at $2000000-$3ffffff
' base module P2 pin number is 32
' all COGs round-robin serviced
' maximum burst limited only device's !CS limit or $ffff (whichever lower)
mem.initHyperDriver(32, 0, $2000000, 0)
' read byte from address $aaaa of HyperFlash
mem.readByte($200aaaa)
' write long $abcdef12 to address $bcd0 of HyperRAM
mem.writeLong($bcd0, $abcdef12)
' a more complex setup and config...
PUB customStart() | bus
' map 16MB HyperRAM only to the $80000000-$80FFFFFF range,
' transfer burst is automatically limited to fit 4uS
bus := mem.mapHyperRam($80000000, S_16MB, 32, 32+12, 32+8, 32+10, 32+15, 0)
' setup all COGs to use round-robin polling with a 256 byte burst limit
mem.setupCogParams(ALLCOGS, bus, 256, 0)
' then make this COG the highest priority COG (priority 7) and don't yield during transfer requests
mem.setupCogParams(1<<cogid(), bus, -1, F_LOCKED + F_PRIORITY + 7)
' start the driver and also enable faster sysclk/1 reads
mem.start(bus, F_FASTREAD)
' start some video driver on this COG, pass it the mailbox address for this COG and HyperRAM address
startVideoCog(cogid(), getMailboxAddr(bus, cogid()), $80000000)
Here's the latest API I have now and it shouldn't need to change much now I hope, maybe some minor name tweaking. In each description using it, "r" represents the returned result/error.
There also may be some scope in the future to map other memory types such as SPI flash using a similar API so the software infrastructure could remain common. Eg. there could be a mapSpiFlash(flashStartAddr, size, miso, mosi, cspin, clkpin) API added etc which could map elsewhere into the common 4GB external memory address space. Some extra overhead in the outer Read/Write functions is required for enabling this probably using method pointers, but the software flexibility gains could be rather good allowing data to be sourced from different devices with the same API. TBD..
'P2-EVAL HyperRAM/HyperFlash simple init
PUB initHyperDriver(basePin, ramStartAddr, flashStartAddr, flags) : bus
PUB initHyperDriverCog(basePin, ramStartAddr, flashStartAddr, flags, cog) : bus
'init/config related
PUB mapHyperRam(ramStartAddr, size, datapin, cspin, clkpin, rwdspin, resetpin, burst) : bus
PUB mapHyperFlash(flashStartAddr, size, datapin, cspin, clkpin, rwdspin, resetpin, burst) : bus
PUB start(bus, flags) : driverCog
PUB startCog(bus, flags, cog) : driverCog
PUB setupCogParams(cogmask, bus, burst, priorityFlags) : cog
PUB removeCogs(cogmask, bus) : r
PUB shutdown(bus) : r
'helpers
PUB getMailboxAddr(bus, cog) : addr
PUB getDriverCogID(bus) : cog
PUB getMaxBurst(frequency, cs_interval, latency) : clocks
'reads
PUB readByte(srcAddr) : r
PUB readWord(srcAddr) : r
PUB readLong(srcAddr) : r
PUB read(dstHubAddr, srcAddr, count) : r
PUB readReg(addr, addrhi_16, addrlo_32) : r
'writes
PUB writeByte(dstAddr, data) : r
PUB writeWord(dstAddr, data) : r
PUB writeLong(dstAddr, data) : r
PUB write(srcHubAddr, dstAddr, count) : r
PUB writeReg(addr, addrhi_16, addrlo_32, value) : r
'complex transfers/request lists
'listPtr is an optional non-zero pointer to build a request list item for later processing instead of executing the single request immediately
PUB readBytes(dstHubAddr, srcAddr, count, listPtr) : r
PUB writeBytes(srcHubAddr, dstAddr, count, listPtr) : r
PUB fillBytes(dstAddr, pattern, count, listPtr) : r
PUB fillWords(dstAddr, pattern, count, listPtr) : r
PUB fillLongs(dstAddr, pattern, count, listPtr) : r
PUB gfxCopyImage(dstAddr, dstPitch, srcAddr, srcPitch, byteWidth, height, hubbuf, listPtr) : r
PUB gfxReadImage(dstHubAddr, dstPitch, srcAddr, srcPitch, byteWidth, height, listPtr) : r
PUB gfxWriteImage(srcHubAddr, srcPitch, dstAddr, dstPitch, byteWidth, height, listPtr) : r
PUB gfxFillBytes(dstAddr, dstPitch, width, height, pattern, listPtr) : r
PUB gfxFillWords(dstAddr, dstPitch, width, height, pattern, listPtr) : r
PUB gfxFillLongs(dstAddr, dstPitch, width, height, pattern, listPtr) : r
PUB copyBuf(dstAddr, srcAddr, totalBytes, hubBuffer, bufSize, listPtr) : r
PUB execList(bus, listPtr) : r
'advanced setup/config
PUB readIR(addr, ir_num, mcpdie_num) : r
PUB readCR(addr, cr_num, mcpdie_num) : r
PUB writeCR(addr, cr_num, mcpdie_num, value) : r
PUB setFlashLatency(addr, latency) : r ' future?
PUB setRamLatency(addr, latency) : r
PUB setBurst(addr, burst) : r
PUB setDelay(addr, delay) : r
PUB getBurst(addr) : r
PUB getDelay(addr) : r
I've updated the above API slightly to make its calls more consistent with respect to address order, and introduced the more generic read/write API for single calls for the typical read/write burst transfers, and made readBytes/writeBytes as the list capable forms.
I think I'll also introduce a non-blocking read/write option in the list. Possibly I could use the MSB of the listPtr as an optional flag that will indicate not to block, and the requesting client can then later poll or wait on its ATN for the result. Even in SPIN this will be useful, especially once longer lists are used, and you'll be able to do work while the data is being transferred in the background.
Why? HUB RAM addresses are only 19 bits. You can OR in the top bit. This is not being used as an external address, just a flag & HUB address in my driver function. We could otherwise add an additional parameter to every call involving list creation but it seems excessive to do it that way. The only reason not to would be if you know your hub addresses of your own managed lists are already using the upper bits for some weird reason and you don't want to have to clear it each time if you don't want to enable non-blocking operation.
Eg. what's wrong with having this:
fillBytes(addr, pattern, count, list)
execList(bus, list | NON_BLOCKING)
I would't mind adding an additional flags argument to just execList so much for this, but it is the dozen other methods that would also need it, when their listPtr is 0 and they want non-blocking operation enabled.
The HyperFLASH is responding! Only needed to flip the endianness to get it to work.
Reading its NVCR gives $8EBB which is what the data sheet says is the default. This proves both reads/writes are functional because you first need to write a special pattern before reading the register.
@rogloh
If some future spin of the silicon can accommodate the full 1MB of Hub RAM then you'll want that 20th bit. That said, if you are satisfied with limiting listPtr to only be able to point at the lower (existing) half of the Hub that's probably ok (but not pretty).
If the coloured chart you produced still accurately represents the fields within the mailbox then you have 4 don't care bits above the list pointer field. Could you give one of those this purpose?
AJL, I'm planning on setting bit 31 not bit 19 in this listPtr to the driver layer so 1MB HUB is still fine down the track. Also the 20 bit list pointer field sent to the HyperRAM driver will already have its top 8 bits overwritten with the special start list request pattern ($BF). This is why I can reuse these upper bits before they even get into the PASM driver's mailbox area. Eg:
' todo add non blocking
PUB execList(bus, listptr) : r | m
m := getMailboxAddr(bus, cogid())
if m < 0
return m
repeat until long[m] >= 0 'don't start another list if the last one hasn't ended
long[m] := (listptr & $fffff) | R_STARTLIST ' R_STARTLIST = $BF<<24
repeat until long[m] >= 0
r := (long[m] == 0) ? 0 : long[m][1]
Note: This Non-blocking thing is something handled in the SPIN layer where it won't wait in the repeat loops above, it is not done in the PASM, though you would want to enable ATN notifications for it to work correctly.
Writing to a HyperFlash sector and the individual words within it is working now.
I found we can also write using a single 512 byte write burst operation after sending the special 3 word unlock sequence. This will improve application performance, avoiding many extra mailbox transactions doing it word at a time etc. With video running we should still be able to get 1-4 HyperFlash mailbox transactions done per scan line, so writes can happen fast enough, certainly faster than the 0.5-2ms write time per 512 bytes, which is limited by the device itself.
Any sector erase time is still not ideal though. It's 2.9 seconds per 256kB sector erased. I still can't get my head around that delay. Writing 1MB will likely take ~13 seconds if you also have to erase the sector first but probably just 1 second if already erased, and it's up to 4 minutes to erase the whole 32MB chip!
This is the first time I've been able to try anything like this until now, and having the proper SPIN2 API integrated makes all the difference for faster experimental testing.
AJL, I'm planning on setting bit 31 not bit 19 in this listPtr to the driver layer so 1MB HUB is still fine down the track. Also the 20 bit list pointer field sent to the HyperRAM driver will already have its top 8 bits overwritten with the special start list request pattern ($BF). This is why I can reuse these upper bits before they even get into the PASM driver's mailbox area. Eg:
' todo add non blocking
PUB execList(bus, listptr) : r | m
m := getMailboxAddr(bus, cogid())
if m < 0
return m
repeat until long[m] >= 0 'don't start another list if the last one hasn't ended
long[m] := (listptr & $fffff) | R_STARTLIST ' R_STARTLIST = $BF<<24
repeat until long[m] >= 0
r := (long[m] == 0) ? 0 : long[m][1]
Note: This Non-blocking thing is something handled in the SPIN layer where it won't wait in the repeat loops above, it is not done in the PASM, though you would want to enable ATN notifications for it to work correctly.
Cluso99, I'm not entirely sure what you are asking about bit31 & bit30 being free if it relates directly to this above SPIN2 example.
However in my mailbox scheme I already make good use bit31 extensively to test whether the mailbox request is active as well as with TJS, etc, and the lower bits including bit30 already contains other data used to indicate the request type (in fact bit30 = read/write).
setq #24-1 'read 24 longs
rdlong req0, mbox 'get all mailbox requests and data longs
polling_code skipf pattern ']dyanmic polling code starts from here....
jatn atn_handler ']JATN (or JINT?) triggers reconfiguration
tjs req0, cog0_handler ']
tjs req1, cog1_handler ']Initially this is just a dummy placeholder
tjs req2, cog2_handler ']loop taking up the most space if there is
tjs req3, cog3_handler ']a polling loop with all round robin COGs.
...
I've found that extracting both C/Z pair in one go with RCZL/RCZR you need to rotate twice or copy/restore the original value as there is no "NR" anymore, so reading them independently as needed is just as fast and avoids that double rotation step.
Actually within the driver I just use the entire 8 bit upper value of the first mailbox long as a table jump index anyway which includes the bank bits / memory type so extracting two flags there are not that important. This method gives me instant branching to where I want it, by both service and memory type (flash or RAM), as well as my control path with special bank 15. It's fast and avoids multiple branches but the jump table does burn 128 COG LONGs though. I can save part of this space in the future if I add 2-3 more instructions per request if I get desperate.
@rogloh
Nothing in particular. I remembered there were instructions to extract the c&z flags. Shame we didn't think about making another instruction that did not rotate.
Speaking of COG space usage, the PASM driver is getting full again now. I added a couple more niceties to prevent potential problems happening in lists trashing the mailbox area and it is now consuming 485 COGRAM longs without any of my state dump debug code included, though I can always push up to 502 longs as I don't use interrupts at this stage. It is also consuming 508 LUT RAMs longs without any debug code.
If I hunt for some more instruction optimizations I'm sure I can shrink it down a little here and there, but I haven't got desperate enough for that yet because it is sort of feature complete now. There's no room left for arbitrary angle pixel plotting at this stage. That potential idea I had might have to be jettisioned for now until that 128 COG RAM long jump table implementation ever gets ditched. Adding this still would speed up non-horizontal and non-vertical line drawing though in 8/16/32bpp modes...so I still like the idea of it.
Same as Brian's results. Although the capacitor degrades the bands a little more. And higher temperature will degrade them further. Running the sources above ...
Frequency bands for HR Read Data, room temperature, data pins P16-P23, clock pin P24:
1-96 MHz, 112-193 MHz, 232-288 MHz: All registered pins, no capacitor.
1-87 MHz, 107-174 MHz, 217-266 MHz: All registered pins, 22 pF capacitor on clock.
1-94 MHz, 107-188 MHz, 221-277 MHz: Registered data pins, no capacitor.
1-84 MHz, 103-168 MHz, 208-256 MHz: Registered data pins, 22 pF capacitor on clock.
1-78 MHz, 87-156 MHz, 173-233 MHz, 264-308 MHz: All unregistered pins, no capacitor.
1-71 MHz, 83-140 MHz, 165-214 MHz, 249-286 MHz: All unregistered pins, 22 pF capacitor on clock.
EDIT: Oops, corrected a bug with HRclock pin registering.
@evanh. I was able to replicate this type of HyperRAM read test using my driver by iterating through the different P2 frequencies from 25MHz to 310 MHz, although I think 45-50MHz is probably about the practical lower operating limit if the 4uS CS time is to be honoured and you want to be able to transfer more than a long at a time, given the overheads and address phase etc.
The code and module appears to work together across the frequency bands if you set the delay appropriately at the transition points. This is what I used to change the delay as I varied the P2 frequency (freq in MHz). The LSB of the delay actually controls registered/unregistered data pin selection which adds a little more delay (it's like a half step of the true delay).
delay := (fast) ? 9 : 11 ' fast <> 0 for sysclk/1 reads
if freq < 270
delay--
if freq < 225
delay--
if freq < 180
delay--
if freq < 120
delay--
if freq < 88
delay--
mem.setDelay(RAM, delay) ' where RAM is the base address of the HyperRAM bank to adjust
I tested out the HyperRAM module in pin positions 0-15, 16-31, and 32-47 and all worked. Operating the module at base pins 0 or 16 seemed to top out at around 308MHz and 304MHz for final successful sysclk/2 and sysclk/1 read rates respectively at 20C room temperature. Running at base pin 32 with the P2-EVAL is slightly slower hitting around 304-305MHz P2 limit for both rates. It's still thankfully a little over 297MHz which is a sweet spot for 1080p.
Test output is attached showing the delay values changes and read test result for HyperRAM. I write a different 256 byte pattern at different addresses once at the start at 200MHz with sysclk/2 and then read back each pattern for different frequencies, then compare byte by byte. It's not an intensive memory test, just there to test my own timing delay values which, when incorrect, quickly show up as a skew offset by one or more bytes. Only the first 16 bytes of what was sent and received back are dumped for brevity.
HyperRAM driver init, result bus = 0
HyperRAM cog id = 1
HyperRAM mailbox addr = 000053C4
Freq=25 MHz, delay=6: read values compared ok
Freq=26 MHz, delay=6: read values compared ok
Freq=27 MHz, delay=6: read values compared ok
Freq=28 MHz, delay=6: read values compared ok
Freq=29 MHz, delay=6: read values compared ok
... <snip>
Freq=85 MHz, delay=6: read values compared ok
Freq=86 MHz, delay=6: read values compared ok
Freq=87 MHz, delay=6: read values compared ok
Freq=88 MHz, delay=7: read values compared ok
Freq=89 MHz, delay=7: read values compared ok
Freq=90 MHz, delay=7: read values compared ok
Freq=91 MHz, delay=7: read values compared ok
... <snip>
Freq=117 MHz, delay=7: read values compared ok
Freq=118 MHz, delay=7: read values compared ok
Freq=119 MHz, delay=7: read values compared ok
Freq=120 MHz, delay=8: read values compared ok
Freq=121 MHz, delay=8: read values compared ok
Freq=122 MHz, delay=8: read values compared ok
Freq=123 MHz, delay=8: read values compared ok
Freq=124 MHz, delay=8: read values compared ok
... <snip>
Freq=176 MHz, delay=8: read values compared ok
Freq=177 MHz, delay=8: read values compared ok
Freq=178 MHz, delay=8: read values compared ok
Freq=179 MHz, delay=8: read values compared ok
Freq=180 MHz, delay=9: read values compared ok
Freq=181 MHz, delay=9: read values compared ok
Freq=182 MHz, delay=9: read values compared ok
Freq=183 MHz, delay=9: read values compared ok
Freq=184 MHz, delay=9: read values compared ok
... <snip>
Freq=221 MHz, delay=9: read values compared ok
Freq=222 MHz, delay=9: read values compared ok
Freq=223 MHz, delay=9: read values compared ok
Freq=224 MHz, delay=9: read values compared ok
Freq=225 MHz, delay=10: read values compared ok
Freq=226 MHz, delay=10: read values compared ok
Freq=227 MHz, delay=10: read values compared ok
Freq=228 MHz, delay=10: read values compared ok
Freq=229 MHz, delay=10: read values compared ok
Freq=230 MHz, delay=10: read values compared ok
... <snip>
Freq=266 MHz, delay=10: read values compared ok
Freq=267 MHz, delay=10: read values compared ok
Freq=268 MHz, delay=10: read values compared ok
Freq=269 MHz, delay=10: read values compared ok
Freq=270 MHz, delay=11: read values compared ok
Freq=271 MHz, delay=11: read values compared ok
Freq=272 MHz, delay=11: read values compared ok
... <snip>
Freq=302 MHz, delay=11: read values compared ok
Freq=303 MHz, delay=11: read values compared ok
Freq=304 MHz, delay=11: read values compared ok
Freq=305 MHz, delay=11: first mismatch at offset 80
00000000 00003F30 : 31 62 93 C4 F5 26 57 88 B9 EA 1B 4C 7D AE DF 10
00000000 00007498 : 31 62 93 C4 F5 26 57 88 B9 EA 1B 4C 7D AE DF 10
Freq=306 MHz, delay=11: first mismatch at offset 104
00000000 00003F30 : 32 64 96 C8 FA 2C 5E 90 C2 F4 26 58 8A BC EE 20
00000000 00007498 : 32 64 96 C8 FA 2C 5E 90 C2 F4 26 58 8A BC EE 20
Freq=307 MHz, delay=11: first mismatch at offset 30
00000000 00003F30 : 33 66 99 CC FF 32 65 98 CB FE 31 64 97 CA FD 30
00000000 00007498 : 33 66 99 CC FF 32 65 98 CB FE 31 64 97 CA FD 30
Freq=308 MHz, delay=11: first mismatch at offset 20
00000000 00003F30 : 34 68 9C D0 04 38 6C A0 D4 08 3C 70 A4 D8 0C 40
00000000 00007498 : 34 68 9C D0 04 38 6C A0 D4 08 3C 70 A4 D8 0C 40
Freq=309 MHz, delay=11: first mismatch at offset 4
00000000 00003F30 : 35 6A 9F D4 09 3E 73 A8 DD 12 47 7C B1 E6 1B 50
00000000 00007498 : 35 6A 9F D4 1D 3E 73 A8 DD 12 47 7C B1 E6 3B 50
Freq=310 MHz, delay=11: first mismatch at offset 0
00000000 00003F30 : 36 6C A2 D8 0E 44 7A B0 E6 1C 52 88 BE F4 2A 60
00000000 00007498 : 3C 00 00 00 00 00 00 00 00 00 3C 00 00 00 00 00
I also tested out the HyperFlash but only seem to get it reading okay from 95-278MHz (sysclk/2) or 191-278MHz (sysclk/1) for some reason. Could be differences in output timing compared to HyperRAM - if so, I am very glad I made my delay a per bank parameter, not global per driver. Still checking.
One thing users will need to know is that reading a burst from flash can introduce gaps in the data when it crosses certain page boundaries. The streamer cannot compensate for this because it does not interpret RWDS as the data byte strobe, and the gaps will end up in hub memory. These gaps can be reduced or eliminated in some case by reducing the latency, but this reduces the upper operating frequency as well. Thankfully this problem does not happen if you start your read from the beginning of the page boundary, so ideally any burst read that crosses the page boundary should really begin there.
Test output is attached showing the delay values changes and read test result for HyperRAM. I write a different 256 byte pattern at different addresses once at the start at 200MHz with sysclk/2 and then read back each pattern for different frequencies, then compare byte by byte. It's not an intensive memory test, just there to test my own timing delay values which, when incorrect, quickly show up as a skew offset by one or more bytes. Only the first 16 bytes of what was sent and received back are dumped for brevity.
Nice work. It would be interesting to see how far the transitions shift under temperature
Did you add a capacitor or is this straight parallax hyper accessory board?
I fixed a bug in the HyperFlash testing and can now get it to read successfully with sysclk/2 transfers from 25MHz to 360MHz (didn't want to try any higher).
However for sysclk/1 read timing and HyperFlash it doesn't seem to follow the same profile as the HyperRAM and I get errors in different ranges if I setup the same input delay as the RAM uses. So it's likely to be the case that it requires a different delay profile. This will be ok as the driver does it per bank, but I'll just need to play more to figure out the new ranges...this is what I found with the delays used earlier:
25-87 MHz ok
88-95 MHz Bad
96-119 MHz ok
120-125 MHz Bad
126-179 MHz ok
180-191 MHz Bad
192-224 MHz ok
225-249 MHz Bad
250-269 MHz ok
270-286 MHz Bad
287-360 MHz ok
That hyperFlash has more frequency bands than expected too. Presumably that's registered clock pin, unregistered data pins, correct?
No, this was with the clock unregistered. It was alternating between registered/unregistered data only through the frequency bands.
I had earlier also tried enabling a registered clock with the HyperRAM only (experimental only right now) and I think it worked only at sysclk/2 IIRC. I probably still have some software timing off with sysclk/1 and registered clock operation in the actual driver code and hopefully may just need to change the delay by another clock to compensate, but I'll need look into that more once I slow it down and hook it back into the logic analyzer again.
If the address phase ges timed wrong all bets are off so this is important. Writes will be rather risky if the HyperRAM thinks it gets a read command instead but the driver then drives data out from its pins from the P2 at the same time as the device does.
Update: It's probably best to register these clock outputs and keep that as the default, plus the upper value is increased. I suspect keeping the clock output timing unregistered is possibly more dependent on path delays through the P2 vs when it is latched but I can't be sure. To do this I'll need to change those breakpoint frequencies again and try to center them in the overlapping portions.
I've been thinking about this page crossing problem in the HyperFlash. I think it makes sense to break apart the transfers that cross page boundaries into multiple portions. So if the page size is 16 bytes and you wanted to transfer 43 bytes from address offset 9 in some page, you would transfer 16-9 = 7 bytes first, then the remaining 36 bytes using some multiple of the page size as the burst size. I can certainly do this in the SPIN2 driver layer for burst reads but it would be good to squeeze it into the PASM driver itself and it would work with gfx and general list transfers etc.
Given the way I already fragment the long bursts and can continue them, this may not be too much code and could probably fit the way I do things. I need to think about it...
Comments
Yes, found all those.=
BTW the latest fastspin uses >= and <= too.
pnut Spin2 leaves cog $000-$131 free and PR0-PR7 is usable too (? $1D8-$1DF)
fastspin leaves cog $000-$01F free only.
Eric is a very productive guy.
Software like this is hard if you leave it alone for too long and need to relearn it each time you come back to it.
In my driver I have just setup some frequency intervals in the CON section and it can easily be altered if some suitable timing profile is known in advance. The table I am using is this (for sysclk/2 read transfer rates):
The effect of these DELAYx values is to do this (the actual bits of DELAYx get split further):
P2 frequency <90MHz : use registered data pins, delay between WYPIX clock output and streamer XINIT read = 6 clocks
90-120MHz : use unregistered pins, delay = 7 clocks
120-180MHz : use registered data pins, delay = 7 clocks
180-225MHz : use unregistered pins, delay = 8 clocks
225-270MHz : use registered data pins, delay = 8 clocks
P2 frequency >270MHz : use unregistered dat pins, delay = 9 clocks
Also if reading at sysclk/1 the above clock delays need to be reduced by 1 because the phase differs by one clock cycle with respect to the streamer.
For the current writes and the address phase for both reads/writes I always leave the data pins registered to keep the timing constant there. For future sysclk/1 writes I might need to review that choice.
What worries me slightly is that boards without the suggested capacitor vs with the capacitor will have different optimal operational frequency ranges. There is no way the driver can fully know which to apply in advance which means any distributed applications/demos etc using the HyperRAM can perform quite differently in different systems depending on them having that capacitor fitted or not, even at the same temp and P2 clock speed. The addition of the capacitor for supporting future sysclk/1 writes reduces the operational range and interval overlaps slightly too, unfortunately around the 297MHz rate which is used for HDTV timing, which is not ideal. Perhaps with proper 3V rated v2 HyperRAM this may not be such an issue.
So there'll be sysclock/2 read/write reliable on all boards, and sysclock/2 writes reliable with sysclock/1 reads somewhat reliable on most boards. And fully sysclock/1 "mostly" reliable on the special boards. Mostly because there is still going to be read frequency bands, but hopefully broad enough to cater for general use without issue.
This information might be important for @"Peter Jakacki" particularly if he plans to fit a capacitor on his up coming P2PAL HyperRAM. Perhaps it might be sufficient just to have the footprint for one on the PCB and people could solder it on later if their want to experiment with sysclk/1 writes.
I think it is quite a bit easier to use now as I have simplified some APIs and only do the bus creation internally within the driver once its first device is mapped. You can now use it as easily as one or two lines like this...
Here's the latest API I have now and it shouldn't need to change much now I hope, maybe some minor name tweaking. In each description using it, "r" represents the returned result/error.
There also may be some scope in the future to map other memory types such as SPI flash using a similar API so the software infrastructure could remain common. Eg. there could be a mapSpiFlash(flashStartAddr, size, miso, mosi, cspin, clkpin) API added etc which could map elsewhere into the common 4GB external memory address space. Some extra overhead in the outer Read/Write functions is required for enabling this probably using method pointers, but the software flexibility gains could be rather good allowing data to be sourced from different devices with the same API. TBD..
I think I'll also introduce a non-blocking read/write option in the list. Possibly I could use the MSB of the listPtr as an optional flag that will indicate not to block, and the requesting client can then later poll or wait on its ATN for the result. Even in SPIN this will be useful, especially once longer lists are used, and you'll be able to do work while the data is being transferred in the background.
Eg. what's wrong with having this:
I would't mind adding an additional flags argument to just execList so much for this, but it is the dozen other methods that would also need it, when their listPtr is 0 and they want non-blocking operation enabled.
Reading its NVCR gives $8EBB which is what the data sheet says is the default. This proves both reads/writes are functional because you first need to write a special pattern before reading the register.
If some future spin of the silicon can accommodate the full 1MB of Hub RAM then you'll want that 20th bit. That said, if you are satisfied with limiting listPtr to only be able to point at the lower (existing) half of the Hub that's probably ok (but not pretty).
If the coloured chart you produced still accurately represents the fields within the mailbox then you have 4 don't care bits above the list pointer field. Could you give one of those this purpose?
Note: This Non-blocking thing is something handled in the SPIN layer where it won't wait in the repeat loops above, it is not done in the PASM, though you would want to enable ATN notifications for it to work correctly.
I found we can also write using a single 512 byte write burst operation after sending the special 3 word unlock sequence. This will improve application performance, avoiding many extra mailbox transactions doing it word at a time etc. With video running we should still be able to get 1-4 HyperFlash mailbox transactions done per scan line, so writes can happen fast enough, certainly faster than the 0.5-2ms write time per 512 bytes, which is limited by the device itself.
Any sector erase time is still not ideal though. It's 2.9 seconds per 256kB sector erased. I still can't get my head around that delay. Writing 1MB will likely take ~13 seconds if you also have to erase the sector first but probably just 1 second if already erased, and it's up to 4 minutes to erase the whole 32MB chip!
This is the first time I've been able to try anything like this until now, and having the proper SPIN2 API integrated makes all the difference for faster experimental testing.
Ok, I understand now.
Is there any point in keeping b31 & b30 free?
You can pass the c & z flags in these. Of course you can test b31 into c on a rdxxxx from hub.
However in my mailbox scheme I already make good use bit31 extensively to test whether the mailbox request is active as well as with TJS, etc, and the lower bits including bit30 already contains other data used to indicate the request type (in fact bit30 = read/write).
I've found that extracting both C/Z pair in one go with RCZL/RCZR you need to rotate twice or copy/restore the original value as there is no "NR" anymore, so reading them independently as needed is just as fast and avoids that double rotation step.
Actually within the driver I just use the entire 8 bit upper value of the first mailbox long as a table jump index anyway which includes the bank bits / memory type so extracting two flags there are not that important. This method gives me instant branching to where I want it, by both service and memory type (flash or RAM), as well as my control path with special bank 15. It's fast and avoids multiple branches but the jump table does burn 128 COG LONGs though. I can save part of this space in the future if I add 2-3 more instructions per request if I get desperate.
Nothing in particular. I remembered there were instructions to extract the c&z flags. Shame we didn't think about making another instruction that did not rotate.
If I hunt for some more instruction optimizations I'm sure I can shrink it down a little here and there, but I haven't got desperate enough for that yet because it is sort of feature complete now. There's no room left for arbitrary angle pixel plotting at this stage. That potential idea I had might have to be jettisioned for now until that 128 COG RAM long jump table implementation ever gets ditched. Adding this still would speed up non-horizontal and non-vertical line drawing though in 8/16/32bpp modes...so I still like the idea of it.
@evanh. I was able to replicate this type of HyperRAM read test using my driver by iterating through the different P2 frequencies from 25MHz to 310 MHz, although I think 45-50MHz is probably about the practical lower operating limit if the 4uS CS time is to be honoured and you want to be able to transfer more than a long at a time, given the overheads and address phase etc.
The code and module appears to work together across the frequency bands if you set the delay appropriately at the transition points. This is what I used to change the delay as I varied the P2 frequency (freq in MHz). The LSB of the delay actually controls registered/unregistered data pin selection which adds a little more delay (it's like a half step of the true delay). I tested out the HyperRAM module in pin positions 0-15, 16-31, and 32-47 and all worked. Operating the module at base pins 0 or 16 seemed to top out at around 308MHz and 304MHz for final successful sysclk/2 and sysclk/1 read rates respectively at 20C room temperature. Running at base pin 32 with the P2-EVAL is slightly slower hitting around 304-305MHz P2 limit for both rates. It's still thankfully a little over 297MHz which is a sweet spot for 1080p.
Test output is attached showing the delay values changes and read test result for HyperRAM. I write a different 256 byte pattern at different addresses once at the start at 200MHz with sysclk/2 and then read back each pattern for different frequencies, then compare byte by byte. It's not an intensive memory test, just there to test my own timing delay values which, when incorrect, quickly show up as a skew offset by one or more bytes. Only the first 16 bytes of what was sent and received back are dumped for brevity.
One thing users will need to know is that reading a burst from flash can introduce gaps in the data when it crosses certain page boundaries. The streamer cannot compensate for this because it does not interpret RWDS as the data byte strobe, and the gaps will end up in hub memory. These gaps can be reduced or eliminated in some case by reducing the latency, but this reduces the upper operating frequency as well. Thankfully this problem does not happen if you start your read from the beginning of the page boundary, so ideally any burst read that crosses the page boundary should really begin there.
Nice work. It would be interesting to see how far the transitions shift under temperature
Did you add a capacitor or is this straight parallax hyper accessory board?
However for sysclk/1 read timing and HyperFlash it doesn't seem to follow the same profile as the HyperRAM and I get errors in different ranges if I setup the same input delay as the RAM uses. So it's likely to be the case that it requires a different delay profile. This will be ok as the driver does it per bank, but I'll just need to play more to figure out the new ranges...this is what I found with the delays used earlier:
That hyperFlash has more frequency bands than expected too. Presumably that's registered clock pin, unregistered data pins, correct?
I had earlier also tried enabling a registered clock with the HyperRAM only (experimental only right now) and I think it worked only at sysclk/2 IIRC. I probably still have some software timing off with sysclk/1 and registered clock operation in the actual driver code and hopefully may just need to change the delay by another clock to compensate, but I'll need look into that more once I slow it down and hook it back into the logic analyzer again.
If the address phase ges timed wrong all bets are off so this is important. Writes will be rather risky if the HyperRAM thinks it gets a read command instead but the driver then drives data out from its pins from the P2 at the same time as the device does.
Update: It's probably best to register these clock outputs and keep that as the default, plus the upper value is increased. I suspect keeping the clock output timing unregistered is possibly more dependent on path delays through the P2 vs when it is latched but I can't be sure. To do this I'll need to change those breakpoint frequencies again and try to center them in the overlapping portions.
Given the way I already fragment the long bursts and can continue them, this may not be too much code and could probably fit the way I do things. I need to think about it...