Memory drivers for P2 - PSRAM/SRAM/HyperRAM (was HyperRAM driver for P2)
rogloh
Posts: 5,786
Beta Release 0.8b of my HyperRAM/HyperFlash external memory driver is now available.
UPDATE 22 DEC 2021
A new Beta release 0.9b of my memory driver is available now including support for PSRAM and SRAM. This works with the PSRAM on the next gen P2-EC32MB EDGE board.
You may need to tweak the timings for these lines:
psram.spin2 :
delayTable long 7,92_000000,150_000000,206_000000,258_000000,310_000000,333_000000,0
memory.spin2 :
PSRamDelays long 7,92_000000,150_000000,206_000000,258_000000,310_000000,333_000000,0
I didn't update them yet specifically for the P2 Edge as I still had them configured for my own PSRAM board, but I will do when I can get some time soon.
UPDATE 22 DEC 2021
A new Beta release 0.9b of my memory driver is available now including support for PSRAM and SRAM. This works with the PSRAM on the next gen P2-EC32MB EDGE board.
You may need to tweak the timings for these lines:
psram.spin2 :
delayTable long 7,92_000000,150_000000,206_000000,258_000000,310_000000,333_000000,0
memory.spin2 :
PSRamDelays long 7,92_000000,150_000000,206_000000,258_000000,310_000000,333_000000,0
I didn't update them yet specifically for the P2 Edge as I still had them configured for my own PSRAM board, but I will do when I can get some time soon.
Comments
https://forums.parallax.com/discussion/170645/proposed-external-hyperram-memory-interface-suitable-for-video-drivers-with-other-cogs
Originally I started with some HyperRAM access code from ozpropdev and extended it then wrapped the rest of the infrastructure required to implement the features listed below. It's been working for a while with my video driver and other COGs sharing the HyperRAM, but in an unrefined state not really in a form suitable for release. At this point I've only worked with the HyperRAM module on a P2-EVAL, so the signal timing of other systems may vary and require further changes, or potentially not work at all at certain speeds/temperatures etc. I've not tested the HyperFlash device yet, that is very complex to write to compared to RAM, but this driver should in theory be able to access it as well.
In any case here is what it should do when done...any further suggestions now are still welcome as the code is still in late development.
Some possible future areas I might investigate or possibly add for a later release:
what I would like to use the Hyper-Ram for (besides Video) would be the ability to bank-switch some HUB area out and another one in. Maybe even more then one Area.
I do have a P2-Eval A and one B, have two Hyper Ram modules and have finished my taxes, so can start to download your Video code and will try to make sense out of that first.
Then I might understand better what you are exactly talking about, but Video Ram out of the HUB ram into the Hyper Ram is a wonderful start, I think 512KB is still to small, so something like EMS or Bank-switching ala CP{/M or even misusing the overlay modus of GCC (still there somebody tried it with PropGCC and it worked).
Sadly you said that non sequential access should be avoided, else I would even try to figure out some LMM running out of a LARGE address space, CATALINA does it for P1 and EEPROMS. Maybe @RossH chimes in again, but Hyper Ram support in Catalina for P2 might not be so complicated.
How about full blown paging? ...any further suggestions now are still welcome
will play tomorrow, thank you for doing this,
Mike
Non-sequential access is still possible with HyperRAM, it is just that the performance can suffer in the more demanding situations. Without video competing for bandwidth, at best you can probably only do in the order of 1 million individual accesses per second if you run the P2 around 250MHz or so compared to probably 20x this number for HUB RAM at the same clock speed with individual random access, however once you start transferring data it can be fast, up to 200MB/s at full rated speed on the P2-EVAL's HyperRAM board. Newer memory devices may go even higher with the P2.
Because of the latency some caching would make sense if you wanted to execute code from the memory brought in from HyperRAM. It might be possible to add some management of caching somewhere, whether partly in this COG or contained within some other caching COG acting on behalf of its requestors or entirely handled within the VM of the COG executing this memory when it is only just going to be a single COG requiring this.
I don't think we'll ever get to full paging...we'd need a mechanism to know when to pull new pages into memory and swap out the old ones, and we have nothing like that built into hardware. So it's probably more of an overlay model or something like the XMM approach for P1 if we can put up with the (variable) latency - some slower or legacy P1 applications might be able to use it if they already worked fine that way for the P1.
* supports up to 15 (14?) devices on the same bus
was that I am thinking I'd like to reserve bank "0" as a way to differentiate hub addresses vs external addresses. Internal HUB RAM addresses range from 0-2^19-1 with upper bits zeroes. This would then mean the bank is zero and we could make use of this to indicate the difference between internal vs external memory. I have 4 bits to indicate the bank. Right now all 1's (%1111) as the bank address is also potentially reserved for configuration, though I could possibly share the 0000 value for this too. Another way to go is simply using the top bit 31 which becomes part of the mailbox service request and would be overwritten anyway. So this is still up in the air.
Of course for the video applications controlling SPI flash from a different dedicated driver is better as it allows more bandwidth and video won't compete with it. We can always have two instances of this driver in such a case, one with a single flash device and the other for HyperRAM devices (at the expense of an extra driver COG).
Update: The good thing is that the way I have implemented the service request+bank lookup and jump code is that in theory it could resolve to Hub Exec code addresses too, so in future additional or more complex external memory types that don't fit inside the COG could still be served by Hub Exec driver code extensions to this driver. These could read from additional mailbox data elements too if needed. This may allow other "plug-in" memory types, or other things that could vary dynamically like mounting filesystems etc. With 7 bits of service table lookup we have effectively 128 different independent memory access "services" available all working on 16MB address ranges by default, or less services with larger memory banks/more mailbox data slots, and many to one "wildcard bit" service mapping is supported to widen the address range. It's a fairly flexible architecture right now. The main thing to consider is total execution time to service a request needs to be kept short (eg. < 1us) or can be broken up into smaller pieces so as to not affect video. SD card accesses would not suit this model for example, but flash memory could.
A simple way to have many 3.3 V HyperBus memory devices, sharing the same 11 or 12 interface signals (CK/CS#(selectivelly gated)/DQ{7:0}/RESET#(selectivelly gated, if ever needed)/RWDS) that consumes no other P2 pins can be made out of dependable and fast little logic OR-gates and flip-flops/latches.
The above approach would leverage from the fact that each HyperBus device floats (HiZ) its DQ{7:0} and RWDS, shortly after CS# goes High (respecting both (tCSH + tDSZ) and (tCSH + tOZ), wich gives a total of 7 nS, after the last CK# = Low, at the end of current/last transaction, making those lines available to be driven by the HyperBus controller (P2) and passed (e.g., thru transparent latches), needing no other control signal than the master (P2) CS#-driving pin to go High.
The logic sequence would be (3.3 V devices):
- ensure master CK (P2 pin) is driven Low (at the completion of current HiperBus transaction);
- drive master CS# (P2 pin) High;
- ensure at least 7 nS (more is better, to account for capacitance / propagation delays (e.g., during this time, a SN74LVC1G32 2 input OR-gate would fully switch its output from Low to Hi (< 4nS), desselecting the currently-selected device);
- drive the selection bits at DQ{7:0} (High = Desselected, Low = Selected);
- ensure at least 8 nS for the selected (Low) signal(s) to propagate thru the transparent latch(es) (4 nS, e.g., 7SN74LVC1G373) and complete the internal control path of each or-gate (< 4 nS);
- If a new transaction is to begin, drive the master (P2-pin) CS# to Low, definitelly latching the selection, so you can use DQ{7:0}, to prepare for the next CA-phase, for the next command to be sent to the HyperThings.
- If the ability of controlling each HR RESET# is desirable, repeat the circuit, now using the master RESET# (P2 pin) as the controlling signal.
The size of those Little Logic devices is so small (down to 1.5 mm X 1 mm in some cases) that one can confuse them with decoupling caps. Also, they are cheap enough, so they don't present a cost problem either.
For write-intensive tasks, having the ability of selecting multiple devices, at the same time, can be used in advantage to, e.g., clear/pre-load many buffers, in parallel, at no extra cost, from P2 pin-usage point of view. You would need to use a unique CK/RWDS pair per independent DQ{7:0} bus if you intend to broadcast (or readback) to/from many other P2.
Using two or more streamers could enable you to broadcast any portion(s) of a shared Hub image to any number of receiving P2 (limited to how many DQ{7:0} buses you are keen to reserve, at the "master" Hub-holding P2. And each new bus opens the possibility of having eight more HyperBus devices, attached to it.
I understand that DQs, CK and RWDS would soon be compromised, due to routing and capacitive effects, but many-layer PCBs and Bottom/Top-layer assembling of smds are not monsters; they only have to be carefully done, in order to work properlly.
Henrique
At this early stage for RESET pins I just nominate a couple of parameters, the minimum low time and the minimum high time before any CS is possible on the device. At driver initialisation time in parallel I then strobe all resets low and wait (for the largest min low time), then strobe all high and wait (for the largest min high time). So with this basic scheme the device RESETs (if enabled) can already share the same pin if they wish, or have their own independent pin though in either case they will still all be pulsed the same at driver startup.
P.S. The same concept can be used to bring RESET# to each device.
P.S.2 Though not shown at the schematic, latches are of the LVC1G373 kind (they don't have a Q# output pin, and their individual CE#s (not shown) are meant to be connected to GND); OR-gates are LVC1G32.
TI seems to have the faster ones, both devices, but I must confess I didn't searched too much, so other brands can have better specs...
P.S.3 A 10k resistor could be used, to ensure Master_CE# = High during and after circuit power up, to keep every HR disabled, untill valid control lines can be properlly set, by the driving P2.
The simple fact that you need to raise Master_CS#, after CK is brought LOW, reading/writing the last byte of the last word, is enough, to ensure each and every HR_CS# will be brought HIGH.
It can be totally avoided, if you intend to do so, by properly programming the corresponding control register, at the HR device.
As for the controlling software, it's enough to sample the state of RWDS, within the CA phase (says, between the first and second HR_CK period, before outputing CA{23:16}).
Then, you gain some HR_CK periods worth of time, to decide if a second latency count will be needed, before entering the data-access phase.
If it can be implemented that way, you be rewarded with shorter access times; in fact, not much each turn, but, like savings, a dime into the pot, everytime.
Looking at the ISSI HyperRAM and OctalRAM, they spec
tDQSV CS# Active to DQSM valid 12 ns (max)
and that seems to give an early indication of if a refresh-adder is needed or not.
The OctalRAM has a latency/MHz table, showing 5 is needed for 133MHz/3v, and 3 is ok for up to 83MHz
Looks like ISSI IS66WVH8M8BLL-100B1LI HyperRAM is stocked.
One quick test would be to enable the faster access, and do a simple test and increment, and visual check, to see how often refresh collisions actually do occur.
Possible effects ?
If CS# spends a bit longer hi, does that reduce the collision hits ?
If CS# pulses, does that do one-refresh per CS edge ? or is n-CLKs needed too ?
I am putting in the following features...
- notification after service completion via optional COGATN as well as the normal mailbox clearing update
- a register control path allowing reconfigurable latency per bank, which is required when different device types are used, such as independent HyperFlash + HyperRAM banks or devices from different manufacturers etc.
I still need to add a few more things for configuration before it is ready for testing, and this configuration is turning out to be much more complex than the actual memory transfers. Due to supporting more than one bank now, much of the original code has been reworked and heavily restructured and further optimised and now needs extensive retesting which I'm not looking forward to.
After this major restructure I have used up 45% of LUT and 95% of COGRAM. I don't expect it to grow massively larger now and there should be space left for what I hope to add now I am using LUTRAM for holding code as well. I expect to save some more COG RAM too with some more optimisations and could balance things further over both RAMs if needed.
I would quite like to add list of requests for audio channel streaming COGs, and am now thinking about other extensions for byteFill, wordFill, longFill type operations which could be very handy and avoid the need for client COG involvement until the operation completes. A similar external memory block copy might be possible one day too, but it still needs hub transfers and a nominated intermediate buffer because of the streamer. I'm certainly not going to get that far right away but I don't want to preclude it in the design either.
One thing I've noticed is that supporting any future fill and copy capabilities somewhat comes naturally with this list concept. To do a fill you can populate two back-to-back commands in a request command list for the COG and then pass it into the mailbox with a special start of list command that points to this list/array in hub memory. The minimum structure for this would take up 5 longs, two 64 bit mailbox req+data combinations plus probably a zero long to terminate the array sequence.
So the two different requests in the array/list for a fill operation would be:
1) issue a special set fill data command which captures a byte/word/long fill argument and saves this state
2) issue a write of the desired element size including the external memory start address in the mailbox request and total transfer count in the data parameter.
Because this fill state is captured (per COG) any subsequent fill operation(s) can then be done without needing additional mailbox parameters to include the fill data every time, or needing to include a larger length parameter which would increase the mailbox size and potentially slow down polling.
This is already good for a single fill operation that can fill up to the entire memory size in one go. To do a sequence of fills of the same type without needing the same fill pattern command each time in the list (e.g., to fill a rectangle on the screen in the same colour) I think I might also look at remaining in fill mode until it is changed or the list completes. So it is more of a fill mode on/off toggle control command within the list.
Also a block copy operation between two external memory addresses can be considered a sequential read burst and then a write burst in a list and could potentially work in a similar way with pairs of commands combined to do the copy operations. An intermediate hub buffer will of course be needed for this (per COG if multiple COGs are doing their own copy operations) because the streamer is involved.
I think it should be possible to copy from different bank to bank in this way too, e.g. from Hyperflash to HyperRAM for example. For non-video applications the copy speed could get fairly fast (up to 4us burst transfers are possible so perhaps 75% the bus bandwidth if the setup overhead is ~1us). With video competing it won't be so fast of course.
I was wondering about the request list just being a packed array (2 longs per element) or a linked list (3 longs per element included a next element pointer). To save on hub memory space I've started with an array which is what is coded right now. It might also one day be possible to support both modes of operation when a request sequence is started. A linked list allows complex request sequences to be patched together or re-arranged dynamically and could avoid some extra copying at the expense of more hub memory usage. Anyway, something to worry about later...
Speaking of which, I've recently read about how Quake's software renderer works. Interestingly enough, the first pass responsible for drawing of static geometry and certain kinds of object (doors and platforms, mostly) writes to the framebuffer in a perfectly linear fashion. It writes each pixel exactly once, left-to-right, top-to-bottom.
(Everything else (enemies, particles, etc.) is then drawn on top (clipped by the 1/z buffer the static pass generated) to avoid either having to splice it all into the BSP tree or having to actually Z-sort polygons.)
I wonder if the technique is viable on P2... (with somewhat simpler geometry and no 1/Z buffering (and thus necessarily much simpler moving entity geometry)). In theory 4 P2 cogs at 250+ MHz should(tm) have integer power comparable to a Pentium's combined integer+float power, the performance is there, but memory is tight.
Time will tell what will be doable on a P2. It's going to be cool to see what can be achieved with the constraints imposed : i.e. no FPU and only 512kB of hub memory; and how much of the workload can benefit using parallel COGs.
Also having multiple HyperRAM devices on different pins would allow exclusive access without video taking any bandwidth on some of those banks and that would help here too. In a gaming application for example, 4 totally independent banks of HyperRAM could consume up to 48 pins (or 44/45 with shared resets) and that would still leave enough IO for video + audio + USB HID control.
Update: Of course in the extreme case above with 4 independent HyperRAM banks you probably wouldn't want to burn 4 extra driver COGs as well so some of those memories would probably need to be coupled directly to certain processing COGs for more dedicated access purposes in the overall pipeline. So maybe it could still be of benefit in some cases depending on how the code can be split up.
The new basic outer processing structure is in place and adaptive COG mailbox polling is working for minimising polling latency.
The faster back-to-back clocks individual byte/word/long writes appear to now be behaving correctly on the scope (no byte banging in the address phase anymore), with adaptive latency based on RWDS and I am currently extending this to block transfer writes and fills with variable size as well as fixed block lengths. Some related stuff I added to control RWDS was discussed in this recent thread https://forums.parallax.com/discussion/comment/1490139/#Comment_1490139
My old slower but functioning HyperRAM routines are still present and I will probably use the old read code to validate the new write code works assuming I can still fit it in the COG along with the new variants.
I still have to do some more work for the zero latency writes, faster reads and external mem to external mem copy and lists with this new driver structure and there is still a little bit more configuration work too. But a lot of that is duplicating what has already been done with writes with some adjustment. HyperFlash probably needs some zero latency bursts as well for decent write performance. Otherwise just doing single word flash writes is going to be really s..l..o..w. Eg. something <50 kB/s or so if video is active. It would be nice to boost to around 2-4MB/s or so using 256 byte bursts.
There's actually quite a lot to all this now. Because it supports multiple device banks using different control pins and breaking up bursts to account for maximum chip select time as well as video COG priority and programmable latencies, etc, there is a fair bit to try to manage. If it was just a single memory device with static control pins and latency things would be far simpler, though not quite as flexible. Once this driver is done I guess I could create a subset version with only a single bank/device supported. It would provide slightly less latency for small accesses though the longer bursts will not gain as much because the setup overhead diminishes relative to the overall service time as the transfer size increases.
If it fits, later I'd also like to add some optional SPI flash read routines so we could map one or more 16MB bank(s) to the SPI flash and this could allow background memory to memory copies from SPI flash directly into HyperRAM. The same thing would be possible to do with HyperFlash. SPI flash access, being slower in general and having Smartpin support for serial shifting etc could be a good candidate to run in HUB exec mode if it didn't fit into this driver.
To get a feeling of how things are starting to get more complex with the overall management of this shared Hyper memory driver, the code pretty much needs to deal with all the following aspects and there is probably even more to come as new issues are discovered or if further extensions get added.
1) Shared Hyper bus:
Multiple devices (up to 15) are possible on the same data bus and a minimum set of per device parameters to be resolved dynamically per access are:
- reset pin (optional, only currently once used at startup)
- clock pin
- chip select pin
- rwds pin
- programmable latency - programmed in device HW register and also configured in driver
- device size if > 16MB
2) Up to 15 different 16MB address banks:
- each bank is mapped to a device (many to one mapping is required if any device size > 16MB). It's probably also desirable for the address range to wraparound within the same bank when burst transfers exceed the address range limit of the bank. TBD.
- 128 request/bank EXECF configurable vectors allow general memory read & write accesses for different devices and a special configuration pathway for any global or device specific register configuration such as reconfiguring the latency, starting fill lists, HyperRAM zero latency register access etc.
3) Mailbox polling:
- Up to 7 COGs to poll (the memory driver COG is automatically excluded)
- priority vs RR (round-robin) polled COGs
- configurable polling priority order
- any nominated excluded COGs are not to be polled
- per COG notification preference (COGATN and/or mailbox only)
- a per COG burst size setting (this is static+common right now, but could be re-configured or dynamic one day perhaps for enforcing bandwidth fairness across RR COGs with a token/leaky bucket type of algorithm)
4) Operational transfer rates
- sysclk/1 or sysclk/2 read data transfer rates
- the maximum rate is normally auto-detected and input delay computed via a look up table but needs also to be overridable to force slower sysclk/2 read operation if a HW system or its normal operating conditions do not support the more aggressive read timing.
- different devices might allow sysclk/1 vs sysclk/2 read operation. Eg, HyperFlash clocks up to 166MHz while the HyperRAM only officially clocks up to 100MHz on the Parallax board though it can also be overclocked slightly, so that could affect the auto detection too.
- writes currently transfer data only at sysclk/2 rates so the clock transition can be centered in the stable middle of the data bit changes making the signal timing far more reliable.
5) Re-configuration pathway
- there is an optional COGATN notification to the memory driver COG once up and running, to trigger regeneration of polling loop code allowing COG clients to potentially come and go dynamically when the driver is informed of the change.
6) Per COG service state, eg. track request lists/fills/copies/bursts in progress.
I would think that putting 16 HyperRAM chips on a board may not be that likely, as they are rather expensive, aren't they? I think most people would just use one chip. If there was a second chip, you could use another cog to drive it and thereby double the throughput. Or, is there any advantage to having multiple chips on the same bus? Does it make inter-chip transfers possible?
If someone wanted to split the busses for the different devices they can then do multiple transfers simultaneously to increase overall bandwidth albeit with more memory COG drivers.
I was wondering about direct inter chip transfers too yesterday, thinking you might be able to drive a different address out to each device if you could suspend clocks at the right time in the latency period. Maybe something is possible where one chip is reading and the other writing sharing a common clock phase over both clock outputs at the same time using a common data bus. It could be tricky to get it to work though not impossible with suitable clock control. It could double the transfer rate and completely avoid using the hub memory as the intermediate buffer.
these have a common clock, and 2 CS
CS1# Input Chip Select 1: Chip Select for the HyperFlash memory.
CS2# Input Chip Select 2: Chip Select for the HyperRAM memory.
Hehe, yes, with separate clocks that could be possible, on paper.