HyperRAM driver for P2

(Reserved for posting driver)
«13

Comments

  • Recently I've been making progress on a HyperRAM driver I've been putting together based on this original proposal/spec, linked below. The driver code will be posted in the top post of this thread when it is ready for release which is hopefully fairly soon.

    https://forums.parallax.com/discussion/170645/proposed-external-hyperram-memory-interface-suitable-for-video-drivers-with-other-cogs

    Originally I started with some HyperRAM access code from ozpropdev and extended it then wrapped the rest of the infrastructure required to implement the features listed below. It's been working for a while with my video driver and other COGs sharing the HyperRAM, but in an unrefined state not really in a form suitable for release. At this point I've only worked with the HyperRAM module on a P2-EVAL, so the signal timing of other systems may vary and require further changes, or potentially not work at all at certain speeds/temperatures etc. I've not tested the HyperFlash device yet, that is very complex to write to compared to RAM, but this driver should in theory be able to access it as well.


    In any case here is what it should do when done...any further suggestions now are still welcome as the code is still in late development.
    P2 HyperRAM External Memory Driver
    ==================================
    
    Features:
    * supports Parallax HyperRAM/HyperFlash P2-EVAL expansion board 
    * is also intended to support other HyperRAM implementations with the P2
    * supports multiple chip selected devices on a shared Hyper data bus
    * configurable control pin allocation & reset for multiple devices
    * data transfers made at sysclk/1 or sysclk/2 byte rates (sysclk/2 can be enforced)
    * transfer rate and read delay configured by operating frequency range
    * multiple COGs can all share the external memory together
    * nominated COGs can have priority access to memory, eg. video & audio COG drivers
    * all other non-priority COGs are round-robin polled for fairness of requests
    * COGs not needing to access this memory can be excluded to reduce polling overhead
    * COG polling loop code is constructed automatically to optimize performance
    * selectable byte/word/long or larger burst transfers to/from external memory
    * longer round-robin COG bursts automatically divided up to guarantee stable video operation
    * a simple mailbox interface is used for COGs requesting external memory service
    * serviced COGs can be notified of results via the mailbox and optional COGATN
    * supports up to 15 (14?) devices on the same bus
    * banks can be arranged in 16MB memory blocks or higher
    * flexible bank to device mapping allowing variable device sizes, eg. 32MB flash
    * inbuilt support for reporting errors, eg. device busy / locked or invalid bank etc
    


    Some possible future areas I might investigate or possibly add for a later release:
    * support a list of multiple requests in single mailbox access 
       (this may be very handy for supporting multiple audio input channels, eg wavetable synthesis)
    * dynamic addition of other RR COGs to the polling list being serviced after driver startup
    * dynamic control of burst size for more bandwidth fairness amongst RR COGs?
    * enabling larger tranfer bursts when video is known to be inactive
    * support for other similar memory devices like octaRAM?
    * multiple HyperRAM driver instances could share a common SW interface and select driver+bank based on address?
    * locks for RR COGs if/when their bursts are divided?
    * wider memory / parallel HyperRAM?
    
  • @rogloh,

    what I would like to use the Hyper-Ram for (besides Video) would be the ability to bank-switch some HUB area out and another one in. Maybe even more then one Area.

    I do have a P2-Eval A and one B, have two Hyper Ram modules and have finished my taxes, so can start to download your Video code and will try to make sense out of that first.

    Then I might understand better what you are exactly talking about, but Video Ram out of the HUB ram into the Hyper Ram is a wonderful start, I think 512KB is still to small, so something like EMS or Bank-switching ala CP{/M or even misusing the overlay modus of GCC (still there somebody tried it with PropGCC and it worked).

    Sadly you said that non sequential access should be avoided, else I would even try to figure out some LMM running out of a LARGE address space, CATALINA does it for P1 and EEPROMS. Maybe @RossH chimes in again, but Hyper Ram support in Catalina for P2 might not be so complicated.

    How about full blown paging? :) ...any further suggestions now are still welcome

    will play tomorrow, thank you for doing this,

    Mike
  • Yes I think bank switching out different HUB regions to HyperRAM is quite possible with block transfers. This would be a done in a software layer that makes calls to the HyperRAM driver to do the necessary block copies between HUB & external memory. It can be done with an API for the requestor COG that sets up the mailbox request(s) for this appropriately.

    Non-sequential access is still possible with HyperRAM, it is just that the performance can suffer in the more demanding situations. Without video competing for bandwidth, at best you can probably only do in the order of 1 million individual accesses per second if you run the P2 around 250MHz or so compared to probably 20x this number for HUB RAM at the same clock speed with individual random access, however once you start transferring data it can be fast, up to 200MB/s at full rated speed on the P2-EVAL's HyperRAM board. Newer memory devices may go even higher with the P2.

    Because of the latency some caching would make sense if you wanted to execute code from the memory brought in from HyperRAM. It might be possible to add some management of caching somewhere, whether partly in this COG or contained within some other caching COG acting on behalf of its requestors or entirely handled within the VM of the COG executing this memory when it is only just going to be a single COG requiring this.

    I don't think we'll ever get to full paging...we'd need a mechanism to know when to pull new pages into memory and swap out the old ones, and we have nothing like that built into hardware. So it's probably more of an overlay model or something like the XMM approach for P1 if we can put up with the (variable) latency - some slower or legacy P1 applications might be able to use it if they already worked fine that way for the P1.
  • By the way, the reason I mentioned this:

    * supports up to 15 (14?) devices on the same bus

    was that I am thinking I'd like to reserve bank "0" as a way to differentiate hub addresses vs external addresses. Internal HUB RAM addresses range from 0-2^19-1 with upper bits zeroes. This would then mean the bank is zero and we could make use of this to indicate the difference between internal vs external memory. I have 4 bits to indicate the bank. Right now all 1's (%1111) as the bank address is also potentially reserved for configuration, though I could possibly share the 0000 value for this too. Another way to go is simply using the top bit 31 which becomes part of the mailbox service request and would be overwritten anyway. So this is still up in the air.
  • roglohrogloh Posts: 1,999
    edited 2020-02-08 - 06:01:06
    Another thing that could become handy to add to this driver (if it fits) is support for reading from a SPI based flash memory bank so that multiple COGs can share the inbuilt flash memory using the exact same type of external memory interface. Then things like audio drivers can playback from external SPI flash just as well as from HyperRAM etc, and application COGs could still read data code from there in parallel. That could become useful...

    Of course for the video applications controlling SPI flash from a different dedicated driver is better as it allows more bandwidth and video won't compete with it. We can always have two instances of this driver in such a case, one with a single flash device and the other for HyperRAM devices (at the expense of an extra driver COG).

    Update: The good thing is that the way I have implemented the service request+bank lookup and jump code is that in theory it could resolve to Hub Exec code addresses too, so in future additional or more complex external memory types that don't fit inside the COG could still be served by Hub Exec driver code extensions to this driver. These could read from additional mailbox data elements too if needed. This may allow other "plug-in" memory types, or other things that could vary dynamically like mounting filesystems etc. With 7 bits of service table lookup we have effectively 128 different independent memory access "services" available all working on 16MB address ranges by default, or less services with larger memory banks/more mailbox data slots, and many to one "wildcard bit" service mapping is supported to widen the address range. It's a fairly flexible architecture right now. The main thing to consider is total execution time to service a request needs to be kept short (eg. < 1us) or can be broken up into smaller pieces so as to not affect video. SD card accesses would not suit this model for example, but flash memory could.
  • Hi rogloh

    A simple way to have many 3.3 V HyperBus memory devices, sharing the same 11 or 12 interface signals (CK/CS#(selectivelly gated)/DQ{7:0}/RESET#(selectivelly gated, if ever needed)/RWDS) that consumes no other P2 pins can be made out of dependable and fast little logic OR-gates and flip-flops/latches.

    The above approach would leverage from the fact that each HyperBus device floats (HiZ) its DQ{7:0} and RWDS, shortly after CS# goes High (respecting both (tCSH + tDSZ) and (tCSH + tOZ), wich gives a total of 7 nS, after the last CK# = Low, at the end of current/last transaction, making those lines available to be driven by the HyperBus controller (P2) and passed (e.g., thru transparent latches), needing no other control signal than the master (P2) CS#-driving pin to go High.

    The logic sequence would be (3.3 V devices):

    - ensure master CK (P2 pin) is driven Low (at the completion of current HiperBus transaction);
    - drive master CS# (P2 pin) High;
    - ensure at least 7 nS (more is better, to account for capacitance / propagation delays (e.g., during this time, a SN74LVC1G32 2 input OR-gate would fully switch its output from Low to Hi (< 4nS), desselecting the currently-selected device);
    - drive the selection bits at DQ{7:0} (High = Desselected, Low = Selected);
    - ensure at least 8 nS for the selected (Low) signal(s) to propagate thru the transparent latch(es) (4 nS, e.g., 7SN74LVC1G373) and complete the internal control path of each or-gate (< 4 nS);
    - If a new transaction is to begin, drive the master (P2-pin) CS# to Low, definitelly latching the selection, so you can use DQ{7:0}, to prepare for the next CA-phase, for the next command to be sent to the HyperThings.

    - If the ability of controlling each HR RESET# is desirable, repeat the circuit, now using the master RESET# (P2 pin) as the controlling signal.

    The size of those Little Logic devices is so small (down to 1.5 mm X 1 mm in some cases) that one can confuse them with decoupling caps. Also, they are cheap enough, so they don't present a cost problem either.

    For write-intensive tasks, having the ability of selecting multiple devices, at the same time, can be used in advantage to, e.g., clear/pre-load many buffers, in parallel, at no extra cost, from P2 pin-usage point of view. You would need to use a unique CK/RWDS pair per independent DQ{7:0} bus if you intend to broadcast (or readback) to/from many other P2.

    Using two or more streamers could enable you to broadcast any portion(s) of a shared Hub image to any number of receiving P2 (limited to how many DQ{7:0} buses you are keen to reserve, at the "master" Hub-holding P2. And each new bus opens the possibility of having eight more HyperBus devices, attached to it.

    I understand that DQs, CK and RWDS would soon be compromised, due to routing and capacitive effects, but many-layer PCBs and Bottom/Top-layer assembling of smds are not monsters; they only have to be carefully done, in order to work properlly.

    Henrique

  • roglohrogloh Posts: 1,999
    edited 2020-02-08 - 23:05:52
    Thanks Henrique. Sound interesting, I'd like to see a schematic to follow this properly. It seems like some way to condense down multiple chip select (CS) pins with a latch. A picture is worth a thousand words. My code assumes independent control of all 3 pins, however it can also share the same pin for CLK and potentially RDWS for multiple devices when they are setup with the same pin number for multiple devices, so probably CS is the only pin that currently needs to scale as you add another device. If you have a clever way to demultiplex the CS we can look at it at some point as an option perhaps. Actually I don't really expect a setup that would use all 15 banks on the same DQ (it's probably unrealistic) but I guess someone might want to try. As you already mention, the issue is that the loading on the shared signals could be compromised by a high fanout.

    At this early stage for RESET pins I just nominate a couple of parameters, the minimum low time and the minimum high time before any CS is possible on the device. At driver initialisation time in parallel I then strobe all resets low and wait (for the largest min low time), then strobe all high and wait (for the largest min high time). So with this basic scheme the device RESETs (if enabled) can already share the same pin if they wish, or have their own independent pin though in either case they will still all be pulsed the same at driver startup.
  • YanomaniYanomani Posts: 958
    edited 2020-02-09 - 02:20:50
    Here it is (tks to Scheme-it)

    CS_Distribution_01.pdf

    P.S. The same concept can be used to bring RESET# to each device.

    P.S.2 Though not shown at the schematic, latches are of the LVC1G373 kind (they don't have a Q# output pin, and their individual CE#s (not shown) are meant to be connected to GND); OR-gates are LVC1G32.

    TI seems to have the faster ones, both devices, but I must confess I didn't searched too much, so other brands can have better specs...

    P.S.3 A 10k resistor could be used, to ensure Master_CE# = High during and after circuit power up, to keep every HR disabled, untill valid control lines can be properlly set, by the driving P2.
  • Thanks Henrique. Now what you are talking about becomes very apparent to me. It could probably be added to the start of the CS sequence. Existing code shown below, address setup bytes are still byte banged right now, but that might be optimized out later. I guess we would just enable the data bus output earlier then send that byte with the multiple chip select state before dropping the master CS low, so it may only add one more instruction to the setup sequence which isn't too bad.
                                fltl    rwdspin                 'prepare pins for next memory access
                                drvl    clkpin
                                drvl    cspin
    
                                wrpin   #0, clkpin              'run clock in GPIO mode
    
    p5                          setbyte dira+PINX, #$FF, #BYTEX 'setup data bus as output
    
                                getbyte pb, addrhi, #3
    p6                          setbyte outa+PINX, pb, #BYTEX
                                drvnot  clkpin
    
  • YanomaniYanomani Posts: 958
    edited 2020-02-09 - 02:36:44
    ... and to its end, after each transaction has ended (CA phase + Latency + Data Block), to ensure de-selection of currently used device (power consumption concerns).

    The simple fact that you need to raise Master_CS#, after CK is brought LOW, reading/writing the last byte of the last word, is enough, to ensure each and every HR_CS# will be brought HIGH.
  • Did I understood it right (your driver), thinking you are using fixed latency counts, to keep deterministic timing between the CA phase and data-access phase?
  • Yes, fixed latency is used to eliminate extra refresh latency based on RWDS state. IIRC ISSI HyperRAM on the Parallax board already uses this by default (in fact it can't be disabled). Variable latency is not supported at this time. The Flash device doesn't use it either. The only issue with flash is that in some burst transfer cases depending on the starting address extra dummy data can be inserted at the crossing of pages which the streamer cannot accommodate. This will either have to be known by the application so it doesn't choose starting addresses that cause this, OR, potentially the burst transfers that cross page boundaries could be broken up into two requests by the driver to prevent this which I suspect is the preferred solution.
  • Fixed latency (2x Latency Count) always add a second latency period (hence the 2x), to ensure deterministic timing between CA phase and the data-access phase, at the expense of longer memory cycles.

    It can be totally avoided, if you intend to do so, by properly programming the corresponding control register, at the HR device.

    As for the controlling software, it's enough to sample the state of RWDS, within the CA phase (says, between the first and second HR_CK period, before outputing CA{23:16}).

    Then, you gain some HR_CK periods worth of time, to decide if a second latency count will be needed, before entering the data-access phase.

    If it can be implemented that way, you be rewarded with shorter access times; in fact, not much each turn, but, like savings, a dime into the pot, everytime.
  • Yes I think once HyperRAM devices that use variable latency are found and utilised it could be revisited with optimisations. It is obviously more difficult to design a driver to setup every type of device if they each have different custom register sequences to setup the latency etc. Right now I am just starting out with their default reset settings (which I know is not ideal). In time it can be optimised further. Ideally the caller that spawns the HyperRAM COG can also run setup code that configures the device after reset (ie. this gets done at a high level via standard memory/register transfers, and not hard coded into the driver COG directly).
  • Yes, this is a better approach, since command register programming can be done well before the first data-access. But the driver Cog routine needs to be aware of this situation, and, in fact, RWDS will only transition to a LOW state during the CA phase, if variable latency has been programmed, thus, if the routine is able to react to this event properlly, the same driver software can be used in both situations.
  • jmgjmg Posts: 14,278
    edited 2020-02-10 - 20:53:05
    rogloh wrote: »
    Yes I think once HyperRAM devices that use variable latency are found and utilised it could be revisited with optimisations. ..

    Looking at the ISSI HyperRAM and OctalRAM, they spec
    tDQSV CS# Active to DQSM valid 12 ns (max)
    and that seems to give an early indication of if a refresh-adder is needed or not.

    The OctalRAM has a latency/MHz table, showing 5 is needed for 133MHz/3v, and 3 is ok for up to 83MHz
    Octal RAM CR[7:4]
    0000 3 clocks 6 clocks 83Mhz
    0001 4 clocks 8 clocks 100Mhz
    0010 5 clocks (default at 3V) 10 clocks 133Mhz
    0011 6 clocks 12 clocks 150MHz
    0100 7 clocks 14 clocks NA
    0101 8 clocks(default at 1.8V) 16 clocks 166/200Mhz(2)
    0100 - 1111 Reserved - NA
    
    HyperRAM  CR7-4 Initial Latency
    0000 - 5 Clock Latency
    0001 - 6 Clock Latency (default)
    0010 - Reserved
    0011 - Reserved
    0100 - Reserved
    ...
    1101 - Reserved
    1110 - 3 Clock Latency
    1111 - 4 Clock Latency
    

    Looks like ISSI IS66WVH8M8BLL-100B1LI HyperRAM is stocked.

    One quick test would be to enable the faster access, and do a simple test and increment, and visual check, to see how often refresh collisions actually do occur.
    Possible effects ?
    If CS# spends a bit longer hi, does that reduce the collision hits ?
    If CS# pulses, does that do one-refresh per CS edge ? or is n-CLKs needed too ?

  • Yes jmg we can and should check for RWDS sometime early after CS falls. I have been figuring out the code to try to do that. It may make sense to add this dynamic check once the byte banging address setup code is reworked to use the streamer. Nothing in the original code I used had that feature and the latency was originally static.

    I am putting in the following features...
    - notification after service completion via optional COGATN as well as the normal mailbox clearing update
    - a register control path allowing reconfigurable latency per bank, which is required when different device types are used, such as independent HyperFlash + HyperRAM banks or devices from different manufacturers etc.

    I still need to add a few more things for configuration before it is ready for testing, and this configuration is turning out to be much more complex than the actual memory transfers. Due to supporting more than one bank now, much of the original code has been reworked and heavily restructured and further optimised and now needs extensive retesting which I'm not looking forward to.

    After this major restructure I have used up 45% of LUT and 95% of COGRAM. I don't expect it to grow massively larger now and there should be space left for what I hope to add now I am using LUTRAM for holding code as well. I expect to save some more COG RAM too with some more optimisations and could balance things further over both RAMs if needed.

    I would quite like to add list of requests for audio channel streaming COGs, and am now thinking about other extensions for byteFill, wordFill, longFill type operations which could be very handy and avoid the need for client COG involvement until the operation completes. A similar external memory block copy might be possible one day too, but it still needs hub transfers and a nominated intermediate buffer because of the streamer. I'm certainly not going to get that far right away but I don't want to preclude it in the design either.

  • Lists of requests would be quite useful when using the video driver, too. Imagine blitting a dirty rectangle into the external framebuffer. That'd need a seperate transfer per line (unless FB width == DR width, of course)
  • Yes that would be useful too Wuerfel_21. I'd also thought about the video graphics applications with this request list feature. If you could combine it with fill operations you could potentially provide a list of start address, length values and quickly fill in rectangles or ellipses or triangles (3d?) quickly as well. This then starts to accelerate some simple graphics functions. It's not quite so good for single pixel update per scanline operations like drawing non-horizontal lines unfortunately due to their non-sequential access to the frame buffer, but for pure overwrite copy/fill stuff that doesn't need read-modify-write operations involving the COG on each scan line the list thing could work well with some graphics primitives and free the COG to do other useful work in parallel.

  • roglohrogloh Posts: 1,999
    edited 2020-02-13 - 08:55:06
    I've coded up the list feature in this driver. It seemed to only add about 40 longs which I am happy with. Still to test it out.

    One thing I've noticed is that supporting any future fill and copy capabilities somewhat comes naturally with this list concept. To do a fill you can populate two back-to-back commands in a request command list for the COG and then pass it into the mailbox with a special start of list command that points to this list/array in hub memory. The minimum structure for this would take up 5 longs, two 64 bit mailbox req+data combinations plus probably a zero long to terminate the array sequence.

    So the two different requests in the array/list for a fill operation would be:
    1) issue a special set fill data command which captures a byte/word/long fill argument and saves this state
    2) issue a write of the desired element size including the external memory start address in the mailbox request and total transfer count in the data parameter.

    Because this fill state is captured (per COG) any subsequent fill operation(s) can then be done without needing additional mailbox parameters to include the fill data every time, or needing to include a larger length parameter which would increase the mailbox size and potentially slow down polling.

    This is already good for a single fill operation that can fill up to the entire memory size in one go. To do a sequence of fills of the same type without needing the same fill pattern command each time in the list (e.g., to fill a rectangle on the screen in the same colour) I think I might also look at remaining in fill mode until it is changed or the list completes. So it is more of a fill mode on/off toggle control command within the list.

    Also a block copy operation between two external memory addresses can be considered a sequential read burst and then a write burst in a list and could potentially work in a similar way with pairs of commands combined to do the copy operations. An intermediate hub buffer will of course be needed for this (per COG if multiple COGs are doing their own copy operations) because the streamer is involved.

    I think it should be possible to copy from different bank to bank in this way too, e.g. from Hyperflash to HyperRAM for example. For non-video applications the copy speed could get fairly fast (up to 4us burst transfers are possible so perhaps 75% the bus bandwidth if the setup overhead is ~1us). With video competing it won't be so fast of course.

    I was wondering about the request list just being a packed array (2 longs per element) or a linked list (3 longs per element included a next element pointer). To save on hub memory space I've started with an array which is what is coded right now. It might also one day be possible to support both modes of operation when a request sequence is started. A linked list allows complex request sequences to be patched together or re-arranged dynamically and could avoid some extra copying at the expense of more hub memory usage. Anyway, something to worry about later...

  • If the P2 had an atomic compare-and-swap instruction, it would be possible for all cogs to send commands to the driver through the same linked list without any locks.
  • That's a huge "if". The eggbeater would need special circuits to perform it, and it would also always stall all other hubRAM accesses.
  • I'm sorry, I asked about this before and forgot about Chip's response. I thought there was already a read-modify-write going on for word and byte accesses, but he clarified that there isn't.
  • Wuerfel_21Wuerfel_21 Posts: 601
    edited 2020-02-13 - 19:18:28
    rogloh wrote: »
    quickly fill in rectangles or ellipses or triangles (3d?) quickly as well.
    Yeah, many filled shapes can be implemented in terms of horizontal lines. Even arbitrary convex polygons are easy to handle.

    Speaking of which, I've recently read about how Quake's software renderer works. Interestingly enough, the first pass responsible for drawing of static geometry and certain kinds of object (doors and platforms, mostly) writes to the framebuffer in a perfectly linear fashion. It writes each pixel exactly once, left-to-right, top-to-bottom.
    (Everything else (enemies, particles, etc.) is then drawn on top (clipped by the 1/z buffer the static pass generated) to avoid either having to splice it all into the BSP tree or having to actually Z-sort polygons.)

    I wonder if the technique is viable on P2... (with somewhat simpler geometry and no 1/Z buffering (and thus necessarily much simpler moving entity geometry)). In theory 4 P2 cogs at 250+ MHz should(tm) have integer power comparable to a Pentium's combined integer+float power, the performance is there, but memory is tight.
  • roglohrogloh Posts: 1,999
    edited 2020-02-13 - 23:18:17
    Wuerfel_21 wrote: »
    I wonder if the technique is viable on P2... (with somewhat simpler geometry and no 1/Z buffering (and thus necessarily much simpler moving entity geometry)). In theory 4 P2 cogs at 250+ MHz should(tm) have integer power comparable to a Pentium's combined integer+float power, the performance is there, but memory is tight.

    Time will tell what will be doable on a P2. It's going to be cool to see what can be achieved with the constraints imposed : i.e. no FPU and only 512kB of hub memory; and how much of the workload can benefit using parallel COGs.

    Also having multiple HyperRAM devices on different pins would allow exclusive access without video taking any bandwidth on some of those banks and that would help here too. In a gaming application for example, 4 totally independent banks of HyperRAM could consume up to 48 pins (or 44/45 with shared resets) and that would still leave enough IO for video + audio + USB HID control.

    Update: Of course in the extreme case above with 4 independent HyperRAM banks you probably wouldn't want to burn 4 extra driver COGs as well so some of those memories would probably need to be coupled directly to certain processing COGs for more dedicated access purposes in the overall pipeline. So maybe it could still be of benefit in some cases depending on how the code can be split up.
  • roglohrogloh Posts: 1,999
    edited 2020-02-22 - 06:05:24
    I have been playing with more of this driver over the last week or so in my time available. It's slowly starting to get there but it is complex and I'd say the actual HyperRAM raw transfer stuff is probably one of the simpler aspects to of all of this.

    The new basic outer processing structure is in place and adaptive COG mailbox polling is working for minimising polling latency.

    The faster back-to-back clocks individual byte/word/long writes appear to now be behaving correctly on the scope (no byte banging in the address phase anymore), with adaptive latency based on RWDS and I am currently extending this to block transfer writes and fills with variable size as well as fixed block lengths. Some related stuff I added to control RWDS was discussed in this recent thread https://forums.parallax.com/discussion/comment/1490139/#Comment_1490139

    My old slower but functioning HyperRAM routines are still present and I will probably use the old read code to validate the new write code works assuming I can still fit it in the COG along with the new variants.

    I still have to do some more work for the zero latency writes, faster reads and external mem to external mem copy and lists with this new driver structure and there is still a little bit more configuration work too. But a lot of that is duplicating what has already been done with writes with some adjustment. HyperFlash probably needs some zero latency bursts as well for decent write performance. Otherwise just doing single word flash writes is going to be really s..l..o..w. Eg. something <50 kB/s or so if video is active. It would be nice to boost to around 2-4MB/s or so using 256 byte bursts.

    There's actually quite a lot to all this now. Because it supports multiple device banks using different control pins and breaking up bursts to account for maximum chip select time as well as video COG priority and programmable latencies, etc, there is a fair bit to try to manage. If it was just a single memory device with static control pins and latency things would be far simpler, though not quite as flexible. Once this driver is done I guess I could create a subset version with only a single bank/device supported. It would provide slightly less latency for small accesses though the longer bursts will not gain as much because the setup overhead diminishes relative to the overall service time as the transfer size increases.

    If it fits, later I'd also like to add some optional SPI flash read routines so we could map one or more 16MB bank(s) to the SPI flash and this could allow background memory to memory copies from SPI flash directly into HyperRAM. The same thing would be possible to do with HyperFlash. SPI flash access, being slower in general and having Smartpin support for serial shifting etc could be a good candidate to run in HUB exec mode if it didn't fit into this driver.

    To get a feeling of how things are starting to get more complex with the overall management of this shared Hyper memory driver, the code pretty much needs to deal with all the following aspects and there is probably even more to come as new issues are discovered or if further extensions get added.

    1) Shared Hyper bus:
    Multiple devices (up to 15) are possible on the same data bus and a minimum set of per device parameters to be resolved dynamically per access are:
    - reset pin (optional, only currently once used at startup)
    - clock pin
    - chip select pin
    - rwds pin
    - programmable latency - programmed in device HW register and also configured in driver
    - device size if > 16MB

    2) Up to 15 different 16MB address banks:
    - each bank is mapped to a device (many to one mapping is required if any device size > 16MB). It's probably also desirable for the address range to wraparound within the same bank when burst transfers exceed the address range limit of the bank. TBD.
    - 128 request/bank EXECF configurable vectors allow general memory read & write accesses for different devices and a special configuration pathway for any global or device specific register configuration such as reconfiguring the latency, starting fill lists, HyperRAM zero latency register access etc.

    3) Mailbox polling:
    - Up to 7 COGs to poll (the memory driver COG is automatically excluded)
    - priority vs RR (round-robin) polled COGs
    - configurable polling priority order
    - any nominated excluded COGs are not to be polled
    - per COG notification preference (COGATN and/or mailbox only)
    - a per COG burst size setting (this is static+common right now, but could be re-configured or dynamic one day perhaps for enforcing bandwidth fairness across RR COGs with a token/leaky bucket type of algorithm)

    4) Operational transfer rates
    - sysclk/1 or sysclk/2 read data transfer rates
    - the maximum rate is normally auto-detected and input delay computed via a look up table but needs also to be overridable to force slower sysclk/2 read operation if a HW system or its normal operating conditions do not support the more aggressive read timing.
    - different devices might allow sysclk/1 vs sysclk/2 read operation. Eg, HyperFlash clocks up to 166MHz while the HyperRAM only officially clocks up to 100MHz on the Parallax board though it can also be overclocked slightly, so that could affect the auto detection too.
    - writes currently transfer data only at sysclk/2 rates so the clock transition can be centered in the stable middle of the data bit changes making the signal timing far more reliable.

    5) Re-configuration pathway
    - there is an optional COGATN notification to the memory driver COG once up and running, to trigger regeneration of polling loop code allowing COG clients to potentially come and go dynamically when the driver is informed of the change.

    6) Per COG service state, eg. track request lists/fills/copies/bursts in progress.
  • Sounds great, Rogloh.

    I would think that putting 16 HyperRAM chips on a board may not be that likely, as they are rather expensive, aren't they? I think most people would just use one chip. If there was a second chip, you could use another cog to drive it and thereby double the throughput. Or, is there any advantage to having multiple chips on the same bus? Does it make inter-chip transfers possible?
  • I would tend to agree about 15-16 devices not being practical/realistic. However the Parallax board already has 2 chips fitted on the one bus, and once you mix two devices then supporting more than two is not a great deal harder in software, it's mainly extra space. A full nibble is allocated to the device/bank selection which can be apportioned between either device address or bank, so you could have two banks of 128MB each or 15 banks of 16MB devices for example. It's flexible.

    If someone wanted to split the busses for the different devices they can then do multiple transfers simultaneously to increase overall bandwidth albeit with more memory COG drivers.

    I was wondering about direct inter chip transfers too yesterday, thinking you might be able to drive a different address out to each device if you could suspend clocks at the right time in the latency period. Maybe something is possible where one chip is reading and the other writing sharing a common clock phase over both clock outputs at the same time using a common data bus. It could be tricky to get it to work though not impossible with suitable clock control. It could double the transfer rate and completely avoid using the hub memory as the intermediate buffer.
  • I've wondered about doing that same thing, albeit on a much smaller scale... an SRAM with one of those SSD1331 (or other) displays. Haven't tried working on it, though - this would be neat to see.
  • jmgjmg Posts: 14,278
    rogloh wrote: »
    I would tend to agree about 15-16 devices not being practical/realistic. However the Parallax board already has 2 chips fitted on the one bus, and once you mix two devices then supporting more than two is not a great deal harder in software, it's mainly extra space. A full nibble is allocated to the device/bank selection which can be apportioned between either device address or bank, so you could have two banks of 128MB each or 15 banks of 16MB devices for example. It's flexible.
    Someone may want to combine RAM and FLASH, and you can already buy RAM and FLASH dual-die parts.
    these have a common clock, and 2 CS
    CS1# Input Chip Select 1: Chip Select for the HyperFlash memory.
    CS2# Input Chip Select 2: Chip Select for the HyperRAM memory.
    rogloh wrote: »
    If someone wanted to split the busses for the different devices they can then do multiple transfers simultaneously to increase overall bandwidth albeit with more memory COG drivers.

    I was wondering about direct inter chip transfers too yesterday, thinking you might be able to drive a different address out to each device if you could suspend clocks at the right time in the latency period. Maybe something is possible where one chip is reading and the other writing sharing a common clock phase over both clock outputs at the same time using a common data bus. It could be tricky to get it to work though not impossible with suitable clock control. It could double the transfer rate and completely avoid using the hub memory as the intermediate buffer.

    Hehe, yes, with separate clocks that could be possible, on paper.
Sign In or Register to comment.