Well the final address of the 2 long burst will always be 8 bytes after mailboxPtr1.
Eg. A, B, C are contiguous longs in HUB
A <--- mailboxPtr1
B <--- mailboxPtr2
C
First WRLONG writes longs B&C using mailboxPtr2 as a 2 long burst with SETQ #2-1
Second WRLONG writes A using mailboxPtr1, immediately after the first WRLONG burst.
Oh, right, I'm talking longword addressing too, or more specifically how the accesses occur. If there is longword misalignment then that imposes a +1 sysclock as well.
So, for an 8-cog prop2 and assuming not misaligned, modulo'd, address A is six sysclocks after address C. The single WRLONG to mailboxPtr1 will take 6 sysclocks to execute.
Ok good, 6 clocks is close to 3 than 10 at least. Thanks evanh. This is the penalty that will (only) happen for PASM HyperRAM driver clients that use the fifo when issuing requests. In fact it's actually just 5 more because the optimal approach needs an extra clock for its 3 mailbox long write anyway.
Definitely will be leaving the mailbox order as is now, the read side penalty after that proposed mailbox order change would be far worse than these 5 extra clocks during request setup, given single reads already take 9-16 clocks.
With all our recent experience a couple of things stand out now:
- The number of valid frequency bands in the ping testing is only two for registered and three for unregistered pins.
- And, related, the bottom good band reaches up to sysclock of 150 MHz for unregistered, and up to 290 MHz for registered pins!
Admittedly, this is not with burst data transfers but, as you can see, there is a huge difference between the transition bands of those two. My take is that with the faster v2 HyperRAMs we'll see much better outcomes.
PS: Testing is with my glob-top revB chip.
EDIT: Got it wrong for unregistered pins. It is three valid frequency bands. so, the main change is the frequencies are all scaled up heaps. Still the same conclusion really - Faster HyperRAM and short tracks on the board will have a big impact.
I also cut a couple of pins off the chip to see how much faster they go without the loading of the PCB:
- Registered went roughly from 290 MHz to 360 MHz.
- Unregistered went roughly from 150 MHz to 185 MHz.
Note: That's the bottom band in both cases.
EDIT: The non-globtop revB is about the same results.
Here's the source code. It works on revA too, since that was its original target. When running it, you have to pipe/capture the output to a file. It outputs almost 480 kB.
PS: I've got the baud set to 230400. Easy to adjust by setting "asyn_baud" on line 34 of the source.
Tubular recently indicated to me that v2 HyperRAM is now available in volume now I think so with any luck we'll be able to see how much better v2 really works on the P2 once we get a board with that fitted and try it.
I've been working a little on the HyperFlash program/erase APIs today. The Flash device is complex and the initial driver API certainly won't cover all its features and capabilities. However low level access to internal flash registers is still possible in cases where further control is required, e.g. sector locks, secure silicon region, password protection, etc. I am making it so that flash erase and program can optionally timeout after some given time, or alternatively allow other non-flash operations to continue while the device is busy (set the wait timeout=0), and the client can poll the flash status itself.
Given the flash device gets blocked during erase/programming, the client(s) will have to use their own locks to ensure that no other COG are accessing flash during erase operations etc. This driver won't manage all that, at least at this stage, though a layer above this driver potentially could assist. I'd need to think up a way to only allow one COG access to flash otherwise via the mailbox.
Here's the simplified API (which may still change):
' read status word
PUB readFlashStatus(addr) : r
' erase 256kB sector
PUB eraseFlashSector256k(addr, timeoutMs) : r
' erase entire Flash device
PUB eraseFlashDevice(addr, timeoutMs) : r
' program Flash data from HUB RAM buffer, must be word aligned in flash
PUB programFlash(addr, srcHubAddr, byteCount, timeoutMs) : r
Tubular recently indicated to me that v2 HyperRAM is now available in volume now I think so with any luck we'll be able to see how much better v2 really works on the P2 once we get a board with that fitted and try it.
I see ISSI spec 133Mhz at 3v3, on OctaRAM , and Winbond have 166MHz at 3v3 for 64Mb and are planning 200MHz for 3v3 at 128Mb
I had been hoping to figure out a simple way to support preventing simultaneous HyperFlash reads from the other COGs from messing up a COG preparing the special command sequences for HyperFlash functions like erase, unlock programming, or to access HyperFlash registers etc.
I think this capability will probably have to wait until a future driver version frees up additional space for supporting locks. Adding this would also introduce some extra checking overhead for every read, probably two instructions extra to both test for the locked flash bank case and then conditionally call out somewhere to handle that locked case to check if the calling client COG has the lock or not. Same goes for the flash word/burst/register writes. Normal HyperRAM memory writes thankfully won't be affected, as these use a separate path to deal with latency cases.
Until this or some similar protection scheme gets added, if you are erasing/writing to flash or otherwise accessing its special registers at configuration time, you'll need to co-ordinate this with any other COGs that may also be reading/writing the flash at the same time to prevent this situation. It's not ideal but is still probably workable. We can probably try to take locks inside the SPIN API for programming/reading registers but other PASM COGs or video COGs can still read the flash outside of this API via the mailbox. That's the issue.
Even if locks are added, managing this gets complicated fast.
How are programs going to like making requests to read the flash and have the possibility of the the read failing at any time, either silently, or more preferably with an explicit error code saying "FLASH IN USE" for example because another COG happens to start to erase a block or access a flash register?
Ideally a given COG's flash request while a different COG has the lock could stall until the lock gets released, then its request could continue in the driver to return data, and this would need to include times during fragmented bursts. I could actually do this by returning to the polling loop without completing the request if I detect I need the lock yet it is already taken, but I can see a problem if a high priority COG requesting a flash read gets stalled as it would starve out the polling of lower COGs which might have the lock. Deadlock occurs if the low priority COG has the lock and can't continue during continuous polling of the incomplete high priority COG's request. This would totally hammer the polling loop too and waste P2 cycles. It would also be difficult to elevate the priority of a low priority COG on the fly to remedy this, as the polling code priority is essentially encoded into the polling sequence itself done at COG (re-)config time. I'd have to remove the blocked COG from the polling list and add it back when the lock is released. That's a fair bit of extra code to deal with all that.
I think I've come up with a lightweight (poor man's?) protection scheme for HyperFlash accesses. Here's how it could work:
When a COG wants to modify the flash, i.e. erase, program, or access one of its registers that needs an uninterruptible write sequence to be issued prior to the last access, it would first need to set a special protection flag in the per bank parameters for the flash bank being accessed, along with the ID of the only COG allowed to access the Flash. The SPIN2 API would handle this for the client COG. Internally the API may possibly also use a lock when modifying this flash protection state if there are multiple writers all vying for access to special Flash registers or trying to erase sectors etc, though I expect that should be an unlikely case and as such may not even be supported.
Any read/write requests that access a flash bank will have the protection flag checked by the HyperRAM driver. If this flag is clear the request can continue to proceed, but if the flag is set the code will call out for a further check.
This call will then try to match the allowed COG ID against the requesting COG whose mailbox is being serviced. If they match then the code returns to the calling read/write routine and proceeds to completion. If the COG ID does not match, the access will be prevented and the type of serviced COG will then be checked. If it was a strict priority serviced COG, the request will return with a failure code (eg. FLASH_BUSY), and the strict priority client will need to be written to deal with that eventuality if it could ever be expected to occur in the system. If the requesting COG was round-robin serviced, it's request will stall until a different client releases the flash protection for the bank being accessed. Round robin COGs waiting for flash protection to be released will still consume some clocks to retry during this flash protected state but the good thing is that each stalled request attempt will return to polling fairly early in the sequence, and this will not stop the processing of all other COGs because the round robin polling order will continue to advance on each polling iteration. They will all receive at least 1/n request opportunities where n is the number of enabled RR COGs in the polling loop. Starvation or deadlock should not happen.
The expected timing penalties in clock cycles per request or per burst fragment if this type of flash protection is added to the current driver code are shown in the table below:
I think this is a reasonably simple way for the driver to protect the flash when one COG is programming it and other COGs can still try to access it. While protected flash banks get somewhat affected it would still allow accesses to other flash banks and HyperRAM without introducing much overhead. The register access overhead is not particularly significant given it's done very rarely, and, when a bank is protected, only one COG is going to be accessing it for erasure or a program burst which typically takes far longer to process anyway than the penalty involved.
I can probably fit the first part of this extra protection stuff into my main LUTRAM code paths with some recent shuffling. However it would unfortunately still also take another ~13 COG RAM longs for the extra COG test code which I don't yet have room for unless I move some of it's existing code into HUB exec perhaps or make other optimizations for freeing space which adds four more clock cycles to each request (that's a huge change, more suited to a later release). If I just put this new flash protection code into HUB exec it will slow protected cases down proportionally for all cases > 4 clocks, but maybe even that is still okay.
UPDATE: Some sample PASM for this protection is now coded and it appears to fit! Still untested as yet. I was able to move my rarely called COG reconfiguration ATN handler into HUB exec, which freed some more longs and I scrounged the others with more register sharing. This particular moved code only runs if you wish to dynamically add/remove new COGs from the polling list after driver initialisation time. I'd prefer to not use very much HUB exec because HUB memory corruption from other COGs can then start to affect things in the memory driver, however this particular code is not something that gets called for normal requests so it should be pretty safe to move to hub exec. As with my video driver it's nice to keep these drivers operating as much as you can under HUB memory corruption situations to help you debug thing for as long as possible. Obviously if the memory driver's mailbox area or video config regions get corrupted that could mess things up, but at least that is a smaller target in memory and is not actually executable code.
@evanh, do you think we can use a simple interpolation of the needed input delay for HyperRAM and HyperFlash based on temperature as well as frequency?
I already have some control methods that adjust the delay by operating frequency in this driver's API. At startup each device on the bus can be assigned its own delay profile which defines the frequency breakpoints at which the delay changes. I can also modify this profile and look this up later when instructed by the API, to decide what input delay to use for all following read operations to that device if things change. I am now wondering if I should try to include an optional temperature parameter which could be somehow used to interpolate delay information. If this temperature is not known or just passed in as 0 I guess it could just use the existing room temp default.
Right now these profiles are fairy simple and the related driver code is shown below (final default delays for HyperFlash are still TBD). One way to go is to have the existing setDelayProfile call just apply a whole new profile that is more specific to a particular temperature and the driver can be instructed to use new delay values from that using my setDelayFrequency method if the frequency changes, though it would be sort of nice to come up with something that is automatic, or could somehow construct a new profile on the fly based on temperature. Maybe that is something for later...?
' associate a custom delay profile for a device, no change to actual driver input delay
PUB setDelayProfile(addr, profile) : r | bus, bank
bus := addrMap[addr >> 24]
bank := (addr >> 24) & $f
if bus +> MAX_INSTANCES - 1
return ERR_INVALID
profiles[bus * NUMBANKS + bank] := profile
return 0
' if the frequency changes at runtime this API can be used to adjust the input delay timing for a device
' TODO: tempK is in this API for the future (if temperature compensation can be applied, 0 to ignore)
PUB setDelayFrequency(addr, freq, tempK) : r | bus, bank
bus := addrMap[addr >> 24]
bank := (addr >> 24) & $f
if bus +> MAX_INSTANCES - 1
return ERR_INVALID
return setDelay(addr, lookupInputDelay(freq, profiles[bus * NUMBANKS + bank]))
' looks up input delay to use at a particular frequency from a profile
PRI lookupInputDelay(freq, profile) : delay
delay := long[profile][0]
repeat while long[profile][1]
if freq +< long[profile][1]
quit
profile += 4
delay++
'setDelay
' sets the delay value used in the driver for the memory device mapped to the address
' addr - address of the Hyper device to configure
' delay - nibble value passed is (delayClocks * 2) + (registeredDataBusFlag)
' returns 0 for success or negative error code
PUB setDelay(addr, delay) : r
if delay +> 15
return ERR_INVALID
r := modifyBankParams(addr, $FFFF0FFF, delay << 12)
'Default delay profiles used for HyperFlash and HyperRAM on P2-EVAL HyperRAM breakout board
'operating at room temp. This can be tweaked or others added for different temperatures.
'These delay profiles can be assigned to each configured device at address mapping time.
'The actual operating input delay can also be adjusted on the fly per bank if the variation
'of delay with temperature is already determined and the temperature is known/measurable.
HyperRamDelays long 6,88_000000,120_000000,180_000000,225_000000,270_000000,0
HyperFlashDelays long 5,88_000000,120_000000,180_000000,225_000000,270_000000,0
'The profile format begins with the initial delay value, followed by frequencies at which the
'delay is sequentially increased until either it falls below the next frequency, or the list
'terminates with a zero. Frequencies must be stored in increasing order.
' e.g. using data above
' if 0 <= freq < 88000000 Hz, the delay compensation value is 6,
' if 880000000 <= freq < 120000000 Hz, the delay compensation value is 7,
' if 120000000 <= freq < 180000000 Hz, the delay compensation value is 8,
' ...etc...
' if 270000000 <= freq , the delay compensation value is 11
'
Yes, although the narrowness of the higher bands makes me wary of using the accessory board in this way. I see this approach used more for a prop2 board with dedicated HR tucked in close.
Yeah I expect it's probably not guaranteed to work in all cases. I think the first driver release might just aim for room temp defaults only as I have above, yet still provide an ability for people to adjust this profile themselves in advanced cases, where the breakpoints would need to be determined by them for their own boards/temperatures.
Maybe a separate simple scanning tool could be developed to spit out the required delay profile. We can't really determine it at runtime in the final application because we need to do a PLL frequency scan to find breakpoints. Without the scan, we may find two delay values that work at the given P2 operating frequency but we'd not know which one of these could be marginal and which one is far better to use unless we scan below and above to find where it fails next.
Just tested out my COG access protection scheme for HyperFlash, so far it seems to be working nicely.
When a COG locks the flash bank for its own use, e.g. to prepare to erase a sector, any RR-COG accessing this flash bank can be stalled until the lock is released. The real-time strict priority COGs instead fail with an error code indicating the flash is busy, as they can't be held up. I will also add the optional choice of having any RR COG also being able to fail immediately with this same error code if desired.
It does require some co-operation and it could obviously be bypassed by other PASM clients ignoring this convention and directly overwriting the lock for themselves instead by going though their mailboxes to issue raw register commands, but it now can at least protect the flash from being read while it shouldn't be (eg. during erase/program or in the midst of a register setup sequence, which could corrupt that transaction). In most cases, only one COG should ever really need to modify the flash at a time so this won't be an issue, though you could have several other readers sharing it. I'm happy enough for now with this capability vs what I had before this.
Here's the current API for the driver. Shouldn't need to change too much from here with any luck. I really don't want any more features as it gets progressively harder to fit them in. Just any final bug fixes I find from here.
This list looks fairly extensive now but if you go with defaults you only need to call the init function once then you can use it to read or write data from RAM or Flash at your mapped address right away, it's designed to be pretty easy that way. Also the size of this object in your image is determined by the number of APIs you actually call, at least with how Fastspin builds things. I hope official SPIN2 also has uncalled code removal too (can't recall).
Using Fastspin I built a minimal application with the simple init API method and the single reads/writes/bursts and it was just under 11kB in size of which some of this is Fastspin's own minimal application overhead of 1312 bytes plus around 4000 bytes of the PASM HyperRAM driver. If you include every single API with HyperFlash & list stuff and all custom tweaking APIs etc it grows to just under 24kB, again including the same PASM and Fastspin overheads. It will be interesting to compare this amount against Chip's byte coded SPIN2. I'll need to try that again when I can.
'simplified driver startup with defaults applicable to the P2-EVAL HyperRAM/HyperFlash module
PUB initHyperDriver(basePin, ramAddr, flashAddr, flags, freq) : bus
OR
PUB initHyperDriverCog(basePin, ramAddr, flashAddr, flags, freq, cog) : bus
'driver startup for other custom setups
PUB mapHyperRam(addr, size, datapin, cspin, clkpin, rwdspin, resetpin, burst, delayProfile) : bus
PUB mapHyperFlash(addr, size, datapin, cspin, clkpin, rwdspin, resetpin, burst, delayProfile) : bus
PUB start(bus, flags, freq) : driverCog
OR
PUB startCog(bus, flags, freq, cog) : driverCog
'memory/reg reads
PUB readByte(addr) : r
PUB readWord(addr) : r
PUB readLong(addr) : r
PUB read(dstHubAddr, srcAddr, count) : r
PUB readReg(addr, regaddr) : r
PUB readRaw(addr, addrhi_16, addrlo_32) : r
'memory/reg writes
PUB writeByte(addr, data) : r
PUB writeWord(addr, data) : r
PUB writeLong(addr, data) : r
PUB write(srcHubAddr, dstAddr, count) : r
PUB writeReg(addr, regaddr, value) : r
PUB writeRaw(addr, addrhi_16, addrlo_32, value) : r
'memory read-modify-writes
PUB readModifyByte(addr, data, mask) : r
PUB readModifyWord(addr, data, mask) : r
PUB readModifyLong(addr, data, mask) : r
'fills/copies/list oriented transfers
PUB readList(dstHubAddr, srcAddr, count, listPtr) : r
PUB writeList(srcHubAddr, dstAddr, count, listPtr) : r
PUB fillBytes(addr, pattern, count, listPtr) : r
PUB fillWords(addr, pattern, count, listPtr) : r
PUB fillLongs(addr, pattern, count, listPtr) : r
PUB copyBuf(dstAddr, srcAddr, totalBytes, hubBuffer, bufSize, listPtr) : r
PUB execList(bus, listptr) : r
'graphics specific
PUB gfxCopyImage(dstAddr, dstPitch, srcAddr, srcPitch, byteWidth, height, hubBuf, listPtr) : r
PUB gfxReadImage(dstHubAddr, dstPitch, srcAddr, srcPitch, byteWidth, height, listPtr) : r
PUB gfxWriteImage(srcHubAddr, srcPitch, dstAddr, dstPitch, byteWidth, height, listPtr) : r
PUB gfxFillBytes(dstAddr, dstPitch, width, height, pattern, listPtr) : r
PUB gfxFillWords(dstAddr, dstPitch, width, height, pattern, listPtr) : r
PUB gfxFillLongs(dstAddr, dstPitch, width, height, pattern, listPtr) : r
'HyperFlash specific
PUB eraseFlash(addr, flags) : r
PUB pollEraseStatus(addr) : r
PUB programFlash(addr, srcHubAddr, byteCount) : r
PUB programFlashByte(addr, data) : r
PUB programFlashWord(addr, data) : r
PUB programFlashLong(addr, data) : r
PUB readFlashStatus(addr) : r
PUB clearFlashStatus(addr) : r
PUB readFlashInfo(addr, wordoffset) : r
PUB readFlashICR(addr) : r
PUB readFlashISR(addr) : r
PUB readFlashVCR(addr) : r
PUB readFlashNVCR(addr) : r
PUB writeFlashICR(addr, data) : r
PUB writeFlashISR(addr, data) : r
PUB writeFlashVCR(addr, data) : r
PUB setFlashLatency(addr, latency) : r
'HyperRAM specific
PUB readRamIR(addr, ir_num, mcpdie_num) : r
PUB readRamCR(addr, cr_num, mcpdie_num) : r
PUB writeRamCR(addr, cr_num, mcpdie_num, value) : r
PUB setRamLatency(addr, latency) : r
'misc driver config APIs
PUB setDriverLatency(addr, latency) : r
PUB getDriverLatency(addr) : r
PUB setBurst(addr, burst) : r
PUB getBurst(addr) : burst
PUB setDelay(addr, delay) : r
PUB getDelay(addr) : delay
PUB setDelayProfile(addr, profile) : r
PUB setDelayFrequency(addr, freq, tempK) : r
PUB lockFlashAccess(addr) : r
PUB unlockFlashAccess(addr) : r
PUB getFlashLockedCog(addr) : r
PUB getMaxBurst(frequency, cs_interval, latency) : clocks
PUB getMailboxAddr(bus, cog) : addr
PUB getDriverCogID(bus) : cog
PUB setupCogParams(cogmask, bus, burst, prioFlags) : cog
PUB removeCogs(cogmask, bus) : r
PUB getLastError(bus) : r
PUB shutdown(bus) : i
Just found out that under PNut this same HyperRAM driver takes up 12600 bytes with all features present and called. This includes 3700 bytes of actual PASM code (I was a little out above with my 4000 byte PASM estimate as I do save some extra longs by using "res"). The shared SPIN2 interpreter will use another 4k on top of this of course.
Unfortunately without calling all the extra features in the code like lists, fills, flash API, etc, and just calling the main init and the basic read/write APIs, it only seemed to save me 300 bytes or so which was just from removing the calls themselves, so it doesn't look like dead code methods get eliminated in official SPIN2 (at least in v34s).
So the penalty for using Fastspin is about 6800 bytes when everything is called and this improves as less of the driver is called. For the minimal driver builds it appears that PNut will actually consume a bit more memory overhead which is is a shame given it is interpreted and its code size should be smaller...ideally including dead code elimination would be best.
Hitting around 1.0-1.1MB/s of HyperFlash write speed on a 200MHz P2 when writing single 512kB and 1MB blocks of HUB data to flash (sysclk/2 write speed). When writing it as a full 32MB block I hit more like 2.5MB/s.
HyperFlash erase time seems to be somewhat dependent on contents but a ballpark data point collected was 674mS for erasing a 256kB sector of random pre-programmed data. This is ~380kB/s. If you attempt to erase an already erased sector it seems to know this and returns early in just 12ms.
Update: the chip erase is also dependent on the contents. I was just able to erase a full chip (which was mostly already erased) in only 3s. So I probably need to first fill it completely (maybe with zeroes) and then retest this.
Update2: full chip erase after 32MB written took 47s to complete. So not as bad as the worst case. This was random data (not all zeroes).
Update3: turns out I had a bug with flash at the 16MB crossing point that affected the results which I just fixed, now a full chip erase of 32MB of an $AA pattern takes ~104s, and with 32MB of zeroes programmed it erased in 91 seconds. The full 32MB of programming took ~31 seconds, so it sustains around 1MB/s writing speed. These numbers are more in line with the data sheet values.
Even though not much has been posted lately here things have progressed further and I hopefully fixed the last bug, which took me a while to track down and understand what was happening as I needed to use the logic analyzer again and revisit the original time critical transfer code.
This last problem happened in a special case with the clock pin configured as registered instead of unregistered, and when filling bytes or writing bursts of odd byte lengths > 2 starting from even HyperRAM addresses or writing even sized burst lengths > 2 from odd addresses, where it was then corrupting the last word written. This did not ever happen with unregistered clock pin setting, which was the original default I used in most of my original testing (but is now changed because evanh found registered clocks allows the HyperRam to overclock higher). It didn't happen with words or longs either.
Basically due to the extra timing lag on the clock pin being output generated in these special cases above, the chip select would get raised just before the final clock transition was output, thus not completing the transfer fully and corrupting the last byte. To fix it I was thankfully able to reorder the code and delay the raising of chip select to occur a couple of instructions later which meant the fix took up no space. Pretty important as I have no space anymore, and was getting worried about how it could be fixed until I knew what it was.
Since the fix, I've been running this with my video driver and even have a basic lightweight GUI type of test running now in 1080p 8bpp using a HyperRAM framebuffer. It's actually surprisingly snappy and I now know that this should be a very usable type of application of the HyperRAM.
Still to do:
-Document more
-Integrate better with Pnut SPIN2 for both video & HyperRAM drivers
-Develop some more examples of how to put this to good use
Damn straight! Prop2 paired to hyperRAM is the cat's pyjamas. I so want to make a dedicated layout just for this but just haven't got my PCB layout sorted. I keep getting bogged down before I get going. I'd do it as a 3-sided Eval Board that stopped at P47.
This is why I think Peter's P2D2 with that HyperRAM supporting P2PAL board already fitted should become so convenient when it's ultimately available. No second HyperFlash chip like the P2-EVAL board has, but P2PAL could still be populated with the combo FLASH+RAM part as a more expensive option if he's wired in that second CS# signal to pin 43 (he said he would).
I have started modifying my HyperRAM driver to support a better control channel. This change will then free up bank 15 that was used before, allowing the full 28 bit address range to be mapped and I don't need to deal with as many special cases in the code. With this you could now have two 128MB devices on a bus for example and all addresses are accessible, not just 240MB of the 256MB. I didn't really like that memory gap that was required for the control channel, it was very wasteful and an inconvenience to me.
The original technique I used for supporting the control path was to examine the upper byte in the first mailbox long to decode the request:
Only banks 0-14 could be mapped to a device because bank 15 was the "control" bank.
The REQUEST mapping for regular memory banks was (and still will be) this:
However for the control bank 15, the REQUEST bits got mapped differently:
000 - get bank latency
001 - get device register
010 - get burst parameter data
011 - start new request list
100 - set bank latency
101 - set device register
110 - set burst parameters
111 - set COG parameters and re-configure all COG priorities (I had also used a COGATN for this).
As mentioned I am going to change this now and instead of using bank 15, the driver will use the previously unused mailbox entries for the driver's COG ID to indicate and activate control operations, and for control alone. These control operations typically only need to be done once to initialise at startup, but can still be tweaked later for dynamic changes to experiment with higher performance, to change COG service priorities, and are also used to lock the flash during writes. Because the driver's mailbox is a common resource that might be shared by multiple COGs after startup I will probably also protect it with a lock in the SPIN2 API. I am going to be freeing up COGATN for future use.
The only thing that would need to be special now is the ability to start a request list. I will still need that operation to be done in-band on a per COG mailbox basis (so not through the driver's own mailbox), and the way that will be done will be to just start a write burst at external memory address $FFFFFF in bank 15. This is the same as having the first mailbox entry written as -1. The second (previously unused) mailbox long entry will then become the start of the list in HUB RAM.
The particular case of a genuine write burst starting at this top address is not likely to ever be needed. Burst writes can still be started anywhere in the bank 15 device except from its highest address. You can still start any write burst somewhere lower and also include this maximum address in the actual burst range covered, you just cannot begin on that top address, or you will start a new list. All regular (non-bursting) writes to the top address will work fine too. I doubt this minor restriction is going to be a burden to any calling clients.
The new REQUEST control mapping will look like this. I will also have the bank bits in these requests actually refer to the bank being modified, not set to %1111 as before.
000 - get bank latency
001 - get device register
010 - get burst parameter data
011 - reserved for debug (eg. read COG RAM state)
100 - set bank latency
101 - set device register
110 - set burst parameters
111 - set COG parameters and re-configure all COG priorities
I'm working on the PASM2 and associated SPIN2 driver changes now. From what I have seen it should fit okay and it might even leave some longs free.
This change is also more consistent with a future single COG speed-optimized variant of my driver I have in mind too... that's another motivation here. I think this will work out well.
Update: Corresponding PASM2 & SPIN2 changes are done. Just need to re-validate and fix anything I broke or missed with this.
@Tubular
Basically yes. I've sort of agonized a bit with this but once released I don't really want something to change later that alters the underlying mailbox formats that would then affect calling client software, so this is why I should probably change it now, if it is going to change at all. Same goes for my video driver. I'm just readying these drivers before release so they hopefully won't need to change further in ways that might significantly affect their callers. It's still going to be a beta though for the HyperRAM driver, but I think it should be pretty solid one with no known bugs with any luck.
In the past I've talked about the idea of having different HyperRAM driver variants. If that idea pans out I'd like to keep the existing three long mailbox request structure the same across these as much as possible so I can reuse my existing code where I can.
Ideally we could have all three of these variants below available in the end, though only the first driver version will be released initially. I'd expect the third one to be highly useful too which should probably be the next one developed. The second one is mostly just a cut down version of the first driver with just a little less latency overhead so may not be compelling enough to work on right away. Or there might be a way to combine the second and third if the separated flash & RAM code paths can all fit in the space available and there are only ever one of each of these device types on the bus as the breakout module has. I'm going to think about that idea later, it might help to reduce the work further which would be good.
* Multiple Cog clients, multiple banks, multi-instance (current driver)
- fully featured, supports Flash and RAM device
- should be ideal for video/audio/general use
- includes some graphics acceleration capabilities
- 24 mailbox longs to support 8 COGs, with control channel sent over the driver's mailbox
* Single Cog client, multiple banks, multi-instance
- can still support having flash & RAM devices on same bus such as P2-EVAL HyperRAM module
- no contention with other COGs, so lower latency
- good for single COG non-video uses to extend memory
- 3 long single mailbox only
- control could share same mailbox with top bit indicating control/data (TBD)
- may use COGATN to signal new request instead of polling, driver uses WAITATN
* Single Cog client, single bank, multi-instance
- fastest variant, good for VMs / emulators etc
- some features probably stripped out for speed, eg. maybe no tests for lists/gfx fills etc
- more static parameters in the code instead of being dynamically looked up, e.g. pins, burst size, latency etc
- optionally directly coupled to client via shared LUTRAM path for data transfer speed
- may use COGATN to signal new request instead of polling, driver uses WAITATN or WAITSE#
If ATN & shared LUT is used, the request/response can be rapidly transferred between COGs with minimal latency so that is a good model to have for speed.
Also I wonder, does anyone know when LUT RAM is shared between COGs can you still execute code from it at full speed or do you lose anything there? The documentation indicates that the DDS/LUT streaming modes are impacted, but that should be okay as HyperRAM does not use that. Whose LUTRAM is actually being accessed when it is shared? Does the code setup in one COG's LUTRAM vanish when pairing is first enabled for example, or is it such that writes from one get sent through to the other to the same LUTRAM address? This would be great if the latter was true so only a shared mailbox area is written to both LUTRAMs and code space is not lost from either COG side. I've not played with LUT sharing so I don't really know too much about it yet.
Also I wonder, does anyone know when LUT RAM is shared between COGs can you still execute code from it at full speed or do you lose anything there? The documentation indicates that the DDS/LUT streaming modes are impacted, but that should be okay as HyperRAM does not use that. Whose LUTRAM is actually being accessed when it is shared? Does the code setup in one COG's LUTRAM vanish when pairing is first enabled for example, or is it such that writes from one get sent through to the other to the same LUTRAM address? This would be great if the latter was true so only a shared mailbox area is written to both LUTRAMs and code space is not lost from either COG side. I've not played with LUT sharing so I don't really know too much about it yet.
Evan tested LUT sharing and detected a glitch that Chip has since fixed. Apart from the streamer, I think your best-case scenario applies. SETLUTS #1 allows writes from the other cog, which could be done on one or both cogs.
Comments
Eg. A, B, C are contiguous longs in HUB
First WRLONG writes longs B&C using mailboxPtr2 as a 2 long burst with SETQ #2-1
Second WRLONG writes A using mailboxPtr1, immediately after the first WRLONG burst.
So, for an 8-cog prop2 and assuming not misaligned, modulo'd, address A is six sysclocks after address C. The single WRLONG to mailboxPtr1 will take 6 sysclocks to execute.
Definitely will be leaving the mailbox order as is now, the read side penalty after that proposed mailbox order change would be far worse than these 5 extra clocks during request setup, given single reads already take 9-16 clocks.
If delta is below the minimum of the instruction then add totalCogs to delta, otherwise delta is the instruction execution time.
NOTE: "Adr" means longword based addressing, ie: byte addresses divided by four.
I've just been retesting with an old program, and fixing it up at the same time, that I called pin-ping.spin2. At the time, JMG reformatted my results into what he called a waterfall - https://forums.parallax.com/discussion/comment/1467859/#Comment_1467859
With all our recent experience a couple of things stand out now:
- The number of valid frequency bands in the ping testing is only two for registered and three for unregistered pins.
- And, related, the bottom good band reaches up to sysclock of 150 MHz for unregistered, and up to 290 MHz for registered pins!
Admittedly, this is not with burst data transfers but, as you can see, there is a huge difference between the transition bands of those two. My take is that with the faster v2 HyperRAMs we'll see much better outcomes.
PS: Testing is with my glob-top revB chip.
EDIT: Got it wrong for unregistered pins. It is three valid frequency bands. so, the main change is the frequencies are all scaled up heaps. Still the same conclusion really - Faster HyperRAM and short tracks on the board will have a big impact.
- Registered went roughly from 290 MHz to 360 MHz.
- Unregistered went roughly from 150 MHz to 185 MHz.
Note: That's the bottom band in both cases.
EDIT: The non-globtop revB is about the same results.
Here's the source code. It works on revA too, since that was its original target. When running it, you have to pipe/capture the output to a file. It outputs almost 480 kB.
PS: I've got the baud set to 230400. Easy to adjust by setting "asyn_baud" on line 34 of the source.
Given the flash device gets blocked during erase/programming, the client(s) will have to use their own locks to ensure that no other COG are accessing flash during erase operations etc. This driver won't manage all that, at least at this stage, though a layer above this driver potentially could assist. I'd need to think up a way to only allow one COG access to flash otherwise via the mailbox.
Here's the simplified API (which may still change):
I see ISSI spec 133Mhz at 3v3, on OctaRAM , and Winbond have 166MHz at 3v3 for 64Mb and are planning 200MHz for 3v3 at 128Mb
I think this capability will probably have to wait until a future driver version frees up additional space for supporting locks. Adding this would also introduce some extra checking overhead for every read, probably two instructions extra to both test for the locked flash bank case and then conditionally call out somewhere to handle that locked case to check if the calling client COG has the lock or not. Same goes for the flash word/burst/register writes. Normal HyperRAM memory writes thankfully won't be affected, as these use a separate path to deal with latency cases.
Until this or some similar protection scheme gets added, if you are erasing/writing to flash or otherwise accessing its special registers at configuration time, you'll need to co-ordinate this with any other COGs that may also be reading/writing the flash at the same time to prevent this situation. It's not ideal but is still probably workable. We can probably try to take locks inside the SPIN API for programming/reading registers but other PASM COGs or video COGs can still read the flash outside of this API via the mailbox. That's the issue.
How are programs going to like making requests to read the flash and have the possibility of the the read failing at any time, either silently, or more preferably with an explicit error code saying "FLASH IN USE" for example because another COG happens to start to erase a block or access a flash register?
Ideally a given COG's flash request while a different COG has the lock could stall until the lock gets released, then its request could continue in the driver to return data, and this would need to include times during fragmented bursts. I could actually do this by returning to the polling loop without completing the request if I detect I need the lock yet it is already taken, but I can see a problem if a high priority COG requesting a flash read gets stalled as it would starve out the polling of lower COGs which might have the lock. Deadlock occurs if the low priority COG has the lock and can't continue during continuous polling of the incomplete high priority COG's request. This would totally hammer the polling loop too and waste P2 cycles. It would also be difficult to elevate the priority of a low priority COG on the fly to remedy this, as the polling code priority is essentially encoded into the polling sequence itself done at COG (re-)config time. I'd have to remove the blocked COG from the polling list and add it back when the lock is released. That's a fair bit of extra code to deal with all that.
When a COG wants to modify the flash, i.e. erase, program, or access one of its registers that needs an uninterruptible write sequence to be issued prior to the last access, it would first need to set a special protection flag in the per bank parameters for the flash bank being accessed, along with the ID of the only COG allowed to access the Flash. The SPIN2 API would handle this for the client COG. Internally the API may possibly also use a lock when modifying this flash protection state if there are multiple writers all vying for access to special Flash registers or trying to erase sectors etc, though I expect that should be an unlikely case and as such may not even be supported.
Any read/write requests that access a flash bank will have the protection flag checked by the HyperRAM driver. If this flag is clear the request can continue to proceed, but if the flag is set the code will call out for a further check.
This call will then try to match the allowed COG ID against the requesting COG whose mailbox is being serviced. If they match then the code returns to the calling read/write routine and proceeds to completion. If the COG ID does not match, the access will be prevented and the type of serviced COG will then be checked. If it was a strict priority serviced COG, the request will return with a failure code (eg. FLASH_BUSY), and the strict priority client will need to be written to deal with that eventuality if it could ever be expected to occur in the system. If the requesting COG was round-robin serviced, it's request will stall until a different client releases the flash protection for the bank being accessed. Round robin COGs waiting for flash protection to be released will still consume some clocks to retry during this flash protected state but the good thing is that each stalled request attempt will return to polling fairly early in the sequence, and this will not stop the processing of all other COGs because the round robin polling order will continue to advance on each polling iteration. They will all receive at least 1/n request opportunities where n is the number of enabled RR COGs in the polling loop. Starvation or deadlock should not happen.
The expected timing penalties in clock cycles per request or per burst fragment if this type of flash protection is added to the current driver code are shown in the table below:
I think this is a reasonably simple way for the driver to protect the flash when one COG is programming it and other COGs can still try to access it. While protected flash banks get somewhat affected it would still allow accesses to other flash banks and HyperRAM without introducing much overhead. The register access overhead is not particularly significant given it's done very rarely, and, when a bank is protected, only one COG is going to be accessing it for erasure or a program burst which typically takes far longer to process anyway than the penalty involved.
I can probably fit the first part of this extra protection stuff into my main LUTRAM code paths with some recent shuffling. However it would unfortunately still also take another ~13 COG RAM longs for the extra COG test code which I don't yet have room for unless I move some of it's existing code into HUB exec perhaps or make other optimizations for freeing space which adds four more clock cycles to each request (that's a huge change, more suited to a later release). If I just put this new flash protection code into HUB exec it will slow protected cases down proportionally for all cases > 4 clocks, but maybe even that is still okay.
UPDATE: Some sample PASM for this protection is now coded and it appears to fit! Still untested as yet. I was able to move my rarely called COG reconfiguration ATN handler into HUB exec, which freed some more longs and I scrounged the others with more register sharing. This particular moved code only runs if you wish to dynamically add/remove new COGs from the polling list after driver initialisation time. I'd prefer to not use very much HUB exec because HUB memory corruption from other COGs can then start to affect things in the memory driver, however this particular code is not something that gets called for normal requests so it should be pretty safe to move to hub exec. As with my video driver it's nice to keep these drivers operating as much as you can under HUB memory corruption situations to help you debug thing for as long as possible. Obviously if the memory driver's mailbox area or video config regions get corrupted that could mess things up, but at least that is a smaller target in memory and is not actually executable code.
Status:
COGRAM use: 502 LONGs, 0 free
LUTRAM use: 512 LONGs, 0 free
HUBRAM use: 10 LONGs for hub exec
I already have some control methods that adjust the delay by operating frequency in this driver's API. At startup each device on the bus can be assigned its own delay profile which defines the frequency breakpoints at which the delay changes. I can also modify this profile and look this up later when instructed by the API, to decide what input delay to use for all following read operations to that device if things change. I am now wondering if I should try to include an optional temperature parameter which could be somehow used to interpolate delay information. If this temperature is not known or just passed in as 0 I guess it could just use the existing room temp default.
Right now these profiles are fairy simple and the related driver code is shown below (final default delays for HyperFlash are still TBD). One way to go is to have the existing setDelayProfile call just apply a whole new profile that is more specific to a particular temperature and the driver can be instructed to use new delay values from that using my setDelayFrequency method if the frequency changes, though it would be sort of nice to come up with something that is automatic, or could somehow construct a new profile on the fly based on temperature. Maybe that is something for later...?
Maybe a separate simple scanning tool could be developed to spit out the required delay profile. We can't really determine it at runtime in the final application because we need to do a PLL frequency scan to find breakpoints. Without the scan, we may find two delay values that work at the given P2 operating frequency but we'd not know which one of these could be marginal and which one is far better to use unless we scan below and above to find where it fails next.
When a COG locks the flash bank for its own use, e.g. to prepare to erase a sector, any RR-COG accessing this flash bank can be stalled until the lock is released. The real-time strict priority COGs instead fail with an error code indicating the flash is busy, as they can't be held up. I will also add the optional choice of having any RR COG also being able to fail immediately with this same error code if desired.
It does require some co-operation and it could obviously be bypassed by other PASM clients ignoring this convention and directly overwriting the lock for themselves instead by going though their mailboxes to issue raw register commands, but it now can at least protect the flash from being read while it shouldn't be (eg. during erase/program or in the midst of a register setup sequence, which could corrupt that transaction). In most cases, only one COG should ever really need to modify the flash at a time so this won't be an issue, though you could have several other readers sharing it. I'm happy enough for now with this capability vs what I had before this.
This list looks fairly extensive now but if you go with defaults you only need to call the init function once then you can use it to read or write data from RAM or Flash at your mapped address right away, it's designed to be pretty easy that way. Also the size of this object in your image is determined by the number of APIs you actually call, at least with how Fastspin builds things. I hope official SPIN2 also has uncalled code removal too (can't recall).
Using Fastspin I built a minimal application with the simple init API method and the single reads/writes/bursts and it was just under 11kB in size of which some of this is Fastspin's own minimal application overhead of 1312 bytes plus around 4000 bytes of the PASM HyperRAM driver. If you include every single API with HyperFlash & list stuff and all custom tweaking APIs etc it grows to just under 24kB, again including the same PASM and Fastspin overheads. It will be interesting to compare this amount against Chip's byte coded SPIN2. I'll need to try that again when I can.
Unfortunately without calling all the extra features in the code like lists, fills, flash API, etc, and just calling the main init and the basic read/write APIs, it only seemed to save me 300 bytes or so which was just from removing the calls themselves, so it doesn't look like dead code methods get eliminated in official SPIN2 (at least in v34s).
So the penalty for using Fastspin is about 6800 bytes when everything is called and this improves as less of the driver is called. For the minimal driver builds it appears that PNut will actually consume a bit more memory overhead which is is a shame given it is interpreted and its code size should be smaller...ideally including dead code elimination would be best.
HyperFlash erase time seems to be somewhat dependent on contents but a ballpark data point collected was 674mS for erasing a 256kB sector of random pre-programmed data. This is ~380kB/s. If you attempt to erase an already erased sector it seems to know this and returns early in just 12ms.
Update: the chip erase is also dependent on the contents. I was just able to erase a full chip (which was mostly already erased) in only 3s. So I probably need to first fill it completely (maybe with zeroes) and then retest this.
Update2: full chip erase after 32MB written took 47s to complete. So not as bad as the worst case. This was random data (not all zeroes).
Update3: turns out I had a bug with flash at the 16MB crossing point that affected the results which I just fixed, now a full chip erase of 32MB of an $AA pattern takes ~104s, and with 32MB of zeroes programmed it erased in 91 seconds. The full 32MB of programming took ~31 seconds, so it sustains around 1MB/s writing speed. These numbers are more in line with the data sheet values.
This last problem happened in a special case with the clock pin configured as registered instead of unregistered, and when filling bytes or writing bursts of odd byte lengths > 2 starting from even HyperRAM addresses or writing even sized burst lengths > 2 from odd addresses, where it was then corrupting the last word written. This did not ever happen with unregistered clock pin setting, which was the original default I used in most of my original testing (but is now changed because evanh found registered clocks allows the HyperRam to overclock higher). It didn't happen with words or longs either.
Basically due to the extra timing lag on the clock pin being output generated in these special cases above, the chip select would get raised just before the final clock transition was output, thus not completing the transfer fully and corrupting the last byte. To fix it I was thankfully able to reorder the code and delay the raising of chip select to occur a couple of instructions later which meant the fix took up no space. Pretty important as I have no space anymore, and was getting worried about how it could be fixed until I knew what it was.
Since the fix, I've been running this with my video driver and even have a basic lightweight GUI type of test running now in 1080p 8bpp using a HyperRAM framebuffer. It's actually surprisingly snappy and I now know that this should be a very usable type of application of the HyperRAM.
Still to do:
-Document more
-Integrate better with Pnut SPIN2 for both video & HyperRAM drivers
-Develop some more examples of how to put this to good use
The original technique I used for supporting the control path was to examine the upper byte in the first mailbox long to decode the request:
Only banks 0-14 could be mapped to a device because bank 15 was the "control" bank.
The REQUEST mapping for regular memory banks was (and still will be) this:
However for the control bank 15, the REQUEST bits got mapped differently:
As mentioned I am going to change this now and instead of using bank 15, the driver will use the previously unused mailbox entries for the driver's COG ID to indicate and activate control operations, and for control alone. These control operations typically only need to be done once to initialise at startup, but can still be tweaked later for dynamic changes to experiment with higher performance, to change COG service priorities, and are also used to lock the flash during writes. Because the driver's mailbox is a common resource that might be shared by multiple COGs after startup I will probably also protect it with a lock in the SPIN2 API. I am going to be freeing up COGATN for future use.
The only thing that would need to be special now is the ability to start a request list. I will still need that operation to be done in-band on a per COG mailbox basis (so not through the driver's own mailbox), and the way that will be done will be to just start a write burst at external memory address $FFFFFF in bank 15. This is the same as having the first mailbox entry written as -1. The second (previously unused) mailbox long entry will then become the start of the list in HUB RAM.
The particular case of a genuine write burst starting at this top address is not likely to ever be needed. Burst writes can still be started anywhere in the bank 15 device except from its highest address. You can still start any write burst somewhere lower and also include this maximum address in the actual burst range covered, you just cannot begin on that top address, or you will start a new list. All regular (non-bursting) writes to the top address will work fine too. I doubt this minor restriction is going to be a burden to any calling clients.
The new REQUEST control mapping will look like this. I will also have the bank bits in these requests actually refer to the bank being modified, not set to %1111 as before.
I'm working on the PASM2 and associated SPIN2 driver changes now. From what I have seen it should fit okay and it might even leave some longs free.
This change is also more consistent with a future single COG speed-optimized variant of my driver I have in mind too... that's another motivation here. I think this will work out well.
Update: Corresponding PASM2 & SPIN2 changes are done. Just need to re-validate and fix anything I broke or missed with this.
Basically yes. I've sort of agonized a bit with this but once released I don't really want something to change later that alters the underlying mailbox formats that would then affect calling client software, so this is why I should probably change it now, if it is going to change at all. Same goes for my video driver. I'm just readying these drivers before release so they hopefully won't need to change further in ways that might significantly affect their callers. It's still going to be a beta though for the HyperRAM driver, but I think it should be pretty solid one with no known bugs with any luck.
In the past I've talked about the idea of having different HyperRAM driver variants. If that idea pans out I'd like to keep the existing three long mailbox request structure the same across these as much as possible so I can reuse my existing code where I can.
Ideally we could have all three of these variants below available in the end, though only the first driver version will be released initially. I'd expect the third one to be highly useful too which should probably be the next one developed. The second one is mostly just a cut down version of the first driver with just a little less latency overhead so may not be compelling enough to work on right away. Or there might be a way to combine the second and third if the separated flash & RAM code paths can all fit in the space available and there are only ever one of each of these device types on the bus as the breakout module has. I'm going to think about that idea later, it might help to reduce the work further which would be good.
* Multiple Cog clients, multiple banks, multi-instance (current driver)
- fully featured, supports Flash and RAM device
- should be ideal for video/audio/general use
- includes some graphics acceleration capabilities
- 24 mailbox longs to support 8 COGs, with control channel sent over the driver's mailbox
* Single Cog client, multiple banks, multi-instance
- can still support having flash & RAM devices on same bus such as P2-EVAL HyperRAM module
- no contention with other COGs, so lower latency
- good for single COG non-video uses to extend memory
- 3 long single mailbox only
- control could share same mailbox with top bit indicating control/data (TBD)
- may use COGATN to signal new request instead of polling, driver uses WAITATN
* Single Cog client, single bank, multi-instance
- fastest variant, good for VMs / emulators etc
- some features probably stripped out for speed, eg. maybe no tests for lists/gfx fills etc
- more static parameters in the code instead of being dynamically looked up, e.g. pins, burst size, latency etc
- optionally directly coupled to client via shared LUTRAM path for data transfer speed
- may use COGATN to signal new request instead of polling, driver uses WAITATN or WAITSE#
If ATN & shared LUT is used, the request/response can be rapidly transferred between COGs with minimal latency so that is a good model to have for speed.
Also I wonder, does anyone know when LUT RAM is shared between COGs can you still execute code from it at full speed or do you lose anything there? The documentation indicates that the DDS/LUT streaming modes are impacted, but that should be okay as HyperRAM does not use that. Whose LUTRAM is actually being accessed when it is shared? Does the code setup in one COG's LUTRAM vanish when pairing is first enabled for example, or is it such that writes from one get sent through to the other to the same LUTRAM address? This would be great if the latter was true so only a shared mailbox area is written to both LUTRAMs and code space is not lost from either COG side. I've not played with LUT sharing so I don't really know too much about it yet.
Evan tested LUT sharing and detected a glitch that Chip has since fixed. Apart from the streamer, I think your best-case scenario applies. SETLUTS #1 allows writes from the other cog, which could be done on one or both cogs.