HyperRAM driver for P2

1111213141517»

Comments

  • Well the final address of the 2 long burst will always be 8 bytes after mailboxPtr1.

    Eg. A, B, C are contiguous longs in HUB
    A  <--- mailboxPtr1
    B  <--- mailboxPtr2 
    C
    

    First WRLONG writes longs B&C using mailboxPtr2 as a 2 long burst with SETQ #2-1
    Second WRLONG writes A using mailboxPtr1, immediately after the first WRLONG burst.
  • evanhevanh Posts: 9,642
    edited 2020-06-28 - 00:45:56
    Oh, right, I'm talking longword addressing too, or more specifically how the accesses occur. If there is longword misalignment then that imposes a +1 sysclock as well.

    So, for an 8-cog prop2 and assuming not misaligned, modulo'd, address A is six sysclocks after address C. The single WRLONG to mailboxPtr1 will take 6 sysclocks to execute.

  • Ok good, 6 clocks is close to 3 than 10 at least. Thanks evanh. This is the penalty that will (only) happen for PASM HyperRAM driver clients that use the fifo when issuing requests. In fact it's actually just 5 more because the optimal approach needs an extra clock for its 3 mailbox long write anyway.

    Definitely will be leaving the mailbox order as is now, the read side penalty after that proposed mailbox order change would be far worse than these 5 extra clocks during request setup, given single reads already take 9-16 clocks.

  • evanhevanh Posts: 9,642
    edited 2020-06-28 - 01:17:22
    Scratching my head a little, the simplest equation would be for the basic modulo: delta = (nextStartAdr - priorEndAdr) % totalCogs

    If delta is below the minimum of the instruction then add totalCogs to delta, otherwise delta is the instruction execution time.

    NOTE: "Adr" means longword based addressing, ie: byte addresses divided by four.

  • evanhevanh Posts: 9,642
    edited 2020-06-28 - 07:55:45
    Roger,
    I've just been retesting with an old program, and fixing it up at the same time, that I called pin-ping.spin2. At the time, JMG reformatted my results into what he called a waterfall - https://forums.parallax.com/discussion/comment/1467859/#Comment_1467859

    With all our recent experience a couple of things stand out now:
    - The number of valid frequency bands in the ping testing is only two for registered and three for unregistered pins.

    - And, related, the bottom good band reaches up to sysclock of 150 MHz for unregistered, and up to 290 MHz for registered pins!

    Admittedly, this is not with burst data transfers but, as you can see, there is a huge difference between the transition bands of those two. My take is that with the faster v2 HyperRAMs we'll see much better outcomes.

    PS: Testing is with my glob-top revB chip.


    EDIT: Got it wrong for unregistered pins. :( It is three valid frequency bands. so, the main change is the frequencies are all scaled up heaps. Still the same conclusion really - Faster HyperRAM and short tracks on the board will have a big impact.

  • evanhevanh Posts: 9,642
    edited 2020-06-28 - 07:29:23
    I also cut a couple of pins off the chip to see how much faster they go without the loading of the PCB:
    - Registered went roughly from 290 MHz to 360 MHz.
    - Unregistered went roughly from 150 MHz to 185 MHz.
    Note: That's the bottom band in both cases.

    EDIT: The non-globtop revB is about the same results.

    Here's the source code. It works on revA too, since that was its original target. When running it, you have to pipe/capture the output to a file. It outputs almost 480 kB.

    PS: I've got the baud set to 230400. Easy to adjust by setting "asyn_baud" on line 34 of the source.

  • Tubular recently indicated to me that v2 HyperRAM is now available in volume now I think so with any luck we'll be able to see how much better v2 really works on the P2 once we get a board with that fitted and try it.
  • I've been working a little on the HyperFlash program/erase APIs today. The Flash device is complex and the initial driver API certainly won't cover all its features and capabilities. However low level access to internal flash registers is still possible in cases where further control is required, e.g. sector locks, secure silicon region, password protection, etc. I am making it so that flash erase and program can optionally timeout after some given time, or alternatively allow other non-flash operations to continue while the device is busy (set the wait timeout=0), and the client can poll the flash status itself.

    Given the flash device gets blocked during erase/programming, the client(s) will have to use their own locks to ensure that no other COG are accessing flash during erase operations etc. This driver won't manage all that, at least at this stage, though a layer above this driver potentially could assist. I'd need to think up a way to only allow one COG access to flash otherwise via the mailbox.

    Here's the simplified API (which may still change):
    ' read status word
    PUB readFlashStatus(addr) : r
    
    ' erase 256kB sector
    PUB eraseFlashSector256k(addr, timeoutMs) : r  
    
    ' erase entire Flash device
    PUB eraseFlashDevice(addr, timeoutMs) : r
    
    ' program Flash data from HUB RAM buffer, must be word aligned in flash
    PUB programFlash(addr, srcHubAddr, byteCount, timeoutMs) : r
    
  • jmgjmg Posts: 14,372
    rogloh wrote: »
    Tubular recently indicated to me that v2 HyperRAM is now available in volume now I think so with any luck we'll be able to see how much better v2 really works on the P2 once we get a board with that fitted and try it.

    I see ISSI spec 133Mhz at 3v3, on OctaRAM , and Winbond have 166MHz at 3v3 for 64Mb and are planning 200MHz for 3v3 at 128Mb
  • I had been hoping to figure out a simple way to support preventing simultaneous HyperFlash reads from the other COGs from messing up a COG preparing the special command sequences for HyperFlash functions like erase, unlock programming, or to access HyperFlash registers etc.

    I think this capability will probably have to wait until a future driver version frees up additional space for supporting locks. Adding this would also introduce some extra checking overhead for every read, probably two instructions extra to both test for the locked flash bank case and then conditionally call out somewhere to handle that locked case to check if the calling client COG has the lock or not. Same goes for the flash word/burst/register writes. Normal HyperRAM memory writes thankfully won't be affected, as these use a separate path to deal with latency cases.

    Until this or some similar protection scheme gets added, if you are erasing/writing to flash or otherwise accessing its special registers at configuration time, you'll need to co-ordinate this with any other COGs that may also be reading/writing the flash at the same time to prevent this situation. It's not ideal but is still probably workable. We can probably try to take locks inside the SPIN API for programming/reading registers but other PASM COGs or video COGs can still read the flash outside of this API via the mailbox. That's the issue.
  • roglohrogloh Posts: 2,347
    edited 2020-07-03 - 08:36:01
    Even if locks are added, managing this gets complicated fast.

    How are programs going to like making requests to read the flash and have the possibility of the the read failing at any time, either silently, or more preferably with an explicit error code saying "FLASH IN USE" for example because another COG happens to start to erase a block or access a flash register?

    Ideally a given COG's flash request while a different COG has the lock could stall until the lock gets released, then its request could continue in the driver to return data, and this would need to include times during fragmented bursts. I could actually do this by returning to the polling loop without completing the request if I detect I need the lock yet it is already taken, but I can see a problem if a high priority COG requesting a flash read gets stalled as it would starve out the polling of lower COGs which might have the lock. Deadlock occurs if the low priority COG has the lock and can't continue during continuous polling of the incomplete high priority COG's request. This would totally hammer the polling loop too and waste P2 cycles. It would also be difficult to elevate the priority of a low priority COG on the fly to remedy this, as the polling code priority is essentially encoded into the polling sequence itself done at COG (re-)config time. I'd have to remove the blocked COG from the polling list and add it back when the lock is released. That's a fair bit of extra code to deal with all that.
  • roglohrogloh Posts: 2,347
    edited 2020-07-04 - 13:33:27
    I think I've come up with a lightweight (poor man's?) protection scheme for HyperFlash accesses. Here's how it could work:

    When a COG wants to modify the flash, i.e. erase, program, or access one of its registers that needs an uninterruptible write sequence to be issued prior to the last access, it would first need to set a special protection flag in the per bank parameters for the flash bank being accessed, along with the ID of the only COG allowed to access the Flash. The SPIN2 API would handle this for the client COG. Internally the API may possibly also use a lock when modifying this flash protection state if there are multiple writers all vying for access to special Flash registers or trying to erase sectors etc, though I expect that should be an unlikely case and as such may not even be supported.

    Any read/write requests that access a flash bank will have the protection flag checked by the HyperRAM driver. If this flag is clear the request can continue to proceed, but if the flag is set the code will call out for a further check.

    This call will then try to match the allowed COG ID against the requesting COG whose mailbox is being serviced. If they match then the code returns to the calling read/write routine and proceeds to completion. If the COG ID does not match, the access will be prevented and the type of serviced COG will then be checked. If it was a strict priority serviced COG, the request will return with a failure code (eg. FLASH_BUSY), and the strict priority client will need to be written to deal with that eventuality if it could ever be expected to occur in the system. If the requesting COG was round-robin serviced, it's request will stall until a different client releases the flash protection for the bank being accessed. Round robin COGs waiting for flash protection to be released will still consume some clocks to retry during this flash protected state but the good thing is that each stalled request attempt will return to polling fairly early in the sequence, and this will not stop the processing of all other COGs because the round robin polling order will continue to advance on each polling iteration. They will all receive at least 1/n request opportunities where n is the number of enabled RR COGs in the polling loop. Starvation or deadlock should not happen.

    The expected timing penalties in clock cycles per request or per burst fragment if this type of flash protection is added to the current driver code are shown in the table below:

    chart.png

    I think this is a reasonably simple way for the driver to protect the flash when one COG is programming it and other COGs can still try to access it. While protected flash banks get somewhat affected it would still allow accesses to other flash banks and HyperRAM without introducing much overhead. The register access overhead is not particularly significant given it's done very rarely, and, when a bank is protected, only one COG is going to be accessing it for erasure or a program burst which typically takes far longer to process anyway than the penalty involved.

    I can probably fit the first part of this extra protection stuff into my main LUTRAM code paths with some recent shuffling. However it would unfortunately still also take another ~13 COG RAM longs for the extra COG test code which I don't yet have room for unless I move some of it's existing code into HUB exec perhaps or make other optimizations for freeing space which adds four more clock cycles to each request (that's a huge change, more suited to a later release). If I just put this new flash protection code into HUB exec it will slow protected cases down proportionally for all cases > 4 clocks, but maybe even that is still okay.

    UPDATE: Some sample PASM for this protection is now coded and it appears to fit! Still untested as yet. I was able to move my rarely called COG reconfiguration ATN handler into HUB exec, which freed some more longs and I scrounged the others with more register sharing. This particular moved code only runs if you wish to dynamically add/remove new COGs from the polling list after driver initialisation time. I'd prefer to not use very much HUB exec because HUB memory corruption from other COGs can then start to affect things in the memory driver, however this particular code is not something that gets called for normal requests so it should be pretty safe to move to hub exec. As with my video driver it's nice to keep these drivers operating as much as you can under HUB memory corruption situations to help you debug thing for as long as possible. Obviously if the memory driver's mailbox area or video config regions get corrupted that could mess things up, but at least that is a smaller target in memory and is not actually executable code.

    Status:
    COGRAM use: 502 LONGs, 0 free
    LUTRAM use: 512 LONGs, 0 free
    HUBRAM use: 10 LONGs for hub exec
    630 x 329 - 52K
  • roglohrogloh Posts: 2,347
    edited 2020-07-06 - 07:03:51
    @evanh, do you think we can use a simple interpolation of the needed input delay for HyperRAM and HyperFlash based on temperature as well as frequency?

    I already have some control methods that adjust the delay by operating frequency in this driver's API. At startup each device on the bus can be assigned its own delay profile which defines the frequency breakpoints at which the delay changes. I can also modify this profile and look this up later when instructed by the API, to decide what input delay to use for all following read operations to that device if things change. I am now wondering if I should try to include an optional temperature parameter which could be somehow used to interpolate delay information. If this temperature is not known or just passed in as 0 I guess it could just use the existing room temp default.

    Right now these profiles are fairy simple and the related driver code is shown below (final default delays for HyperFlash are still TBD). One way to go is to have the existing setDelayProfile call just apply a whole new profile that is more specific to a particular temperature and the driver can be instructed to use new delay values from that using my setDelayFrequency method if the frequency changes, though it would be sort of nice to come up with something that is automatic, or could somehow construct a new profile on the fly based on temperature. Maybe that is something for later...?
    ' associate a custom delay profile for a device, no change to actual driver input delay
    PUB setDelayProfile(addr, profile) : r | bus, bank
        bus := addrMap[addr >> 24]
        bank := (addr >> 24) & $f
        if bus +> MAX_INSTANCES - 1
            return ERR_INVALID
        profiles[bus * NUMBANKS + bank] := profile
        return 0
    
    ' if the frequency changes at runtime this API can be used to adjust the input delay timing for a device
    ' TODO: tempK is in this API for the future (if temperature compensation can be applied, 0 to ignore)
    PUB setDelayFrequency(addr, freq, tempK) : r | bus, bank
        bus := addrMap[addr >> 24]
        bank := (addr >> 24) & $f
        if bus +> MAX_INSTANCES - 1
            return ERR_INVALID
        return setDelay(addr, lookupInputDelay(freq, profiles[bus * NUMBANKS + bank]))
    
    ' looks up input delay to use at a particular frequency from a profile
    PRI lookupInputDelay(freq, profile) : delay 
        delay := long[profile][0]
        repeat while long[profile][1] 
            if freq +< long[profile][1] 
                quit
            profile += 4
            delay++
    
    'setDelay
    ' sets the delay value used in the driver for the memory device mapped to the address
    ' addr - address of the Hyper device to configure
    ' delay - nibble value passed is (delayClocks * 2) + (registeredDataBusFlag)
    ' returns 0 for success or negative error code
    PUB setDelay(addr, delay) : r 
        if delay +> 15
            return ERR_INVALID
        r := modifyBankParams(addr, $FFFF0FFF, delay << 12)
     
    'Default delay profiles used for HyperFlash and HyperRAM on P2-EVAL HyperRAM breakout board 
    'operating at room temp. This can be tweaked or others added for different temperatures.
    'These delay profiles can be assigned to each configured device at address mapping time.
    'The actual operating input delay can also be adjusted on the fly per bank if the variation 
    'of delay with temperature is already determined and the temperature is known/measurable.
    
    HyperRamDelays   long 6,88_000000,120_000000,180_000000,225_000000,270_000000,0
    HyperFlashDelays long 5,88_000000,120_000000,180_000000,225_000000,270_000000,0
    
    'The profile format begins with the initial delay value, followed by frequencies at which the
    'delay is sequentially increased until either it falls below the next frequency, or the list 
    'terminates with a zero.  Frequencies must be stored in increasing order.
    
    ' e.g. using data above
    '   if            0 <= freq <  88000000 Hz, the delay compensation value is 6,
    '   if    880000000 <= freq < 120000000 Hz, the delay compensation value is 7,
    '   if    120000000 <= freq < 180000000 Hz, the delay compensation value is 8,
    '                   ...etc...
    '   if    270000000 <= freq               , the delay compensation value is 11 
    '
    
  • Yes, although the narrowness of the higher bands makes me wary of using the accessory board in this way. I see this approach used more for a prop2 board with dedicated HR tucked in close.

  • roglohrogloh Posts: 2,347
    edited 2020-07-07 - 01:35:20
    Yeah I expect it's probably not guaranteed to work in all cases. I think the first driver release might just aim for room temp defaults only as I have above, yet still provide an ability for people to adjust this profile themselves in advanced cases, where the breakpoints would need to be determined by them for their own boards/temperatures.

    Maybe a separate simple scanning tool could be developed to spit out the required delay profile. We can't really determine it at runtime in the final application because we need to do a PLL frequency scan to find breakpoints. Without the scan, we may find two delay values that work at the given P2 operating frequency but we'd not know which one of these could be marginal and which one is far better to use unless we scan below and above to find where it fails next.
  • Just tested out my COG access protection scheme for HyperFlash, so far it seems to be working nicely. :smile:

    When a COG locks the flash bank for its own use, e.g. to prepare to erase a sector, any RR-COG accessing this flash bank can be stalled until the lock is released. The real-time strict priority COGs instead fail with an error code indicating the flash is busy, as they can't be held up. I will also add the optional choice of having any RR COG also being able to fail immediately with this same error code if desired.

    It does require some co-operation and it could obviously be bypassed by other PASM clients ignoring this convention and directly overwriting the lock for themselves instead by going though their mailboxes to issue raw register commands, but it now can at least protect the flash from being read while it shouldn't be (eg. during erase/program or in the midst of a register setup sequence, which could corrupt that transaction). In most cases, only one COG should ever really need to modify the flash at a time so this won't be an issue, though you could have several other readers sharing it. I'm happy enough for now with this capability vs what I had before this.
  • Nice. I've heard it's tricky to get that stuff right. One of the things taught for Computer Science.

  • roglohrogloh Posts: 2,347
    edited 2020-07-06 - 11:30:41
    Here's the current API for the driver. Shouldn't need to change too much from here with any luck. I really don't want any more features as it gets progressively harder to fit them in. Just any final bug fixes I find from here.

    This list looks fairly extensive now but if you go with defaults you only need to call the init function once then you can use it to read or write data from RAM or Flash at your mapped address right away, it's designed to be pretty easy that way. Also the size of this object in your image is determined by the number of APIs you actually call, at least with how Fastspin builds things. I hope official SPIN2 also has uncalled code removal too (can't recall).

    Using Fastspin I built a minimal application with the simple init API method and the single reads/writes/bursts and it was just under 11kB in size of which some of this is Fastspin's own minimal application overhead of 1312 bytes plus around 4000 bytes of the PASM HyperRAM driver. If you include every single API with HyperFlash & list stuff and all custom tweaking APIs etc it grows to just under 24kB, again including the same PASM and Fastspin overheads. It will be interesting to compare this amount against Chip's byte coded SPIN2. I'll need to try that again when I can.
    'simplified driver startup with defaults applicable to the P2-EVAL HyperRAM/HyperFlash module 
    PUB initHyperDriver(basePin, ramAddr, flashAddr, flags, freq) : bus
      OR
    PUB initHyperDriverCog(basePin, ramAddr, flashAddr, flags, freq, cog) : bus 
    
    'driver startup for other custom setups
    PUB mapHyperRam(addr, size, datapin, cspin, clkpin, rwdspin, resetpin, burst, delayProfile) : bus
    PUB mapHyperFlash(addr, size, datapin, cspin, clkpin, rwdspin, resetpin, burst, delayProfile) : bus
    PUB start(bus, flags, freq) : driverCog
      OR
    PUB startCog(bus, flags, freq, cog) : driverCog 
    
    'memory/reg reads
    PUB readByte(addr) : r 
    PUB readWord(addr) : r 
    PUB readLong(addr) : r 
    PUB read(dstHubAddr, srcAddr, count) : r 
    PUB readReg(addr, regaddr) : r
    PUB readRaw(addr, addrhi_16, addrlo_32) : r 
    
    'memory/reg writes
    PUB writeByte(addr, data) : r 
    PUB writeWord(addr, data) : r 
    PUB writeLong(addr, data) : r 
    PUB write(srcHubAddr, dstAddr, count) : r 
    PUB writeReg(addr, regaddr, value) : r
    PUB writeRaw(addr, addrhi_16, addrlo_32, value) : r 
    
    'memory read-modify-writes
    PUB readModifyByte(addr, data, mask) : r 
    PUB readModifyWord(addr, data, mask) : r 
    PUB readModifyLong(addr, data, mask) : r 
    
    'fills/copies/list oriented transfers
    PUB readList(dstHubAddr, srcAddr, count, listPtr) : r 
    PUB writeList(srcHubAddr, dstAddr, count, listPtr) : r 
    PUB fillBytes(addr, pattern, count, listPtr) : r 
    PUB fillWords(addr, pattern, count, listPtr) : r 
    PUB fillLongs(addr, pattern, count, listPtr) : r 
    PUB copyBuf(dstAddr, srcAddr, totalBytes, hubBuffer, bufSize, listPtr) : r 
    PUB execList(bus, listptr) : r 
    
    'graphics specific
    PUB gfxCopyImage(dstAddr, dstPitch, srcAddr, srcPitch, byteWidth, height, hubBuf, listPtr) : r 
    PUB gfxReadImage(dstHubAddr, dstPitch, srcAddr, srcPitch, byteWidth, height, listPtr) : r 
    PUB gfxWriteImage(srcHubAddr, srcPitch, dstAddr, dstPitch, byteWidth, height, listPtr) : r 
    PUB gfxFillBytes(dstAddr, dstPitch, width, height, pattern, listPtr) : r 
    PUB gfxFillWords(dstAddr, dstPitch, width, height, pattern, listPtr) : r 
    PUB gfxFillLongs(dstAddr, dstPitch, width, height, pattern, listPtr) : r 
    
    'HyperFlash specific
    PUB eraseFlash(addr, flags) : r 
    PUB pollEraseStatus(addr) : r
    PUB programFlash(addr, srcHubAddr, byteCount) : r 
    PUB programFlashByte(addr, data) : r 
    PUB programFlashWord(addr, data) : r 
    PUB programFlashLong(addr, data) : r 
    PUB readFlashStatus(addr) : r 
    PUB clearFlashStatus(addr) : r
    PUB readFlashInfo(addr, wordoffset) : r 
    PUB readFlashICR(addr) : r 
    PUB readFlashISR(addr) : r 
    PUB readFlashVCR(addr) : r 
    PUB readFlashNVCR(addr) : r 
    PUB writeFlashICR(addr, data) : r
    PUB writeFlashISR(addr, data) : r
    PUB writeFlashVCR(addr, data) : r 
    PUB setFlashLatency(addr, latency) : r 
    
    'HyperRAM specific
    PUB readRamIR(addr, ir_num, mcpdie_num) : r 
    PUB readRamCR(addr, cr_num, mcpdie_num) : r 
    PUB writeRamCR(addr, cr_num, mcpdie_num, value) : r 
    PUB setRamLatency(addr, latency) : r 
    
    'misc driver config APIs
    PUB setDriverLatency(addr, latency) : r 
    PUB getDriverLatency(addr) : r 
    PUB setBurst(addr, burst) : r 
    PUB getBurst(addr) : burst
    PUB setDelay(addr, delay) : r 
    PUB getDelay(addr) : delay
    PUB setDelayProfile(addr, profile) : r
    PUB setDelayFrequency(addr, freq, tempK) : r 
    PUB lockFlashAccess(addr) : r
    PUB unlockFlashAccess(addr) : r
    PUB getFlashLockedCog(addr) : r
    PUB getMaxBurst(frequency, cs_interval, latency) : clocks
    PUB getMailboxAddr(bus, cog) : addr 
    PUB getDriverCogID(bus) : cog
    PUB setupCogParams(cogmask, bus, burst, prioFlags) : cog 
    PUB removeCogs(cogmask, bus) : r
    PUB getLastError(bus) : r
    PUB shutdown(bus) : i
    
Sign In or Register to comment.