Someone may want to combine RAM and FLASH, and you can already buy RAM and FLASH dual-die parts.
these have a common clock, and 2 CS
CS1# Input Chip Select 1: Chip Select for the HyperFlash memory.
CS2# Input Chip Select 2: Chip Select for the HyperRAM memory.
Yeah I've heard about these combo parts. This driver design allows multiple banks to nominate the same clock pin so these parts could also be supported just with two different chip select pins. All transfers complete in each bank with the clock left at the same low level before another bank can be accesses so they should work out nicely.
Hehe, yes, with separate clocks that could be possible, on paper.
LOL. Yeah I'm not planning to try it anytime soon. I suspect it could get very tricky to get it right and very dependent on interleaving address phases/latency periods of both devices to avoid bus clashes. A lot of mucking about and it's back to some slower byte banging/clock control at the start of the transfers.
Hehe, yes, with separate clocks that could be possible, on paper.
LOL. Yeah I'm not planning to try it anytime soon. I suspect it could get very tricky to get it right and very dependent on interleaving address phases/latency periods of both devices to avoid bus clashes. A lot of mucking about and it's back to some slower byte banging/clock control at the start of the transfers.
I could see a place for things like fast-splash screens, to give the illusion of fast boot, or the illusion of system speed .., where you fast-copy a screen-full of pre-prepared info, and 'fill in the details' later...
Definitely. With the video driver I have, you can flip to a new frame buffer start address in HyperRAM on the next frame, no need to see a partial filled screen (ie. tearing). Plus if the HyperRAM memory is already preloaded before a video COG starts even accessing HyperRAM you'd have no competition for bandwidth. In that case you could load out of HyperFlash using very large burst sizes to go to hub and then write into HyperRAM (using smaller bursts of perhaps ~320 bytes at a time to limit CS low time). If you operate the P2 at 200MHz and ran at sysclk/2 for both reads and writes you'd could transfer into HUB say 32000B from HyperFlash in something like 321us and write this data back to HyperRAM in say 100*4us (which includes the overhead). Repeat 10 times and you've got yourself a 8bpp VGA screenbuffer loaded in about 7.2ms which is less than one frame time.
With the write burst code now in place and breaking up the large burst transfers into sub 4us chunks, it appears that the dead time between bursts from the same original request where there is no HyperBus activity is taking about 1.2us on a 200MHz P2 in my driver. So it will probably be in the vicinity of 1us for 252MHz when scaled up.
Breaking up longer bursts is required for two reasons:
1) it is required so that a long burst from a non-video COG can't significantly delay video, ensuring video data will make it back in time to be displayed
2) it is required so the chip select low time for a transfer to/from HyperRAM is less than 4us, avoiding refresh issues.
This result is showing that for HyperRAM the best utilisation on the bus is generally not going to be a lot higher than about 80% (when large bursts can consume most of the 4us of time available), unless the P2 can be clocked even higher than 252MHz or this 1us can be scaled back further. Significant code optimisations will probably be limited due to the processing work involved. The only thing I might do is add something in the code where if the requesting COG of a broken up burst is already the highest priority COG, the code avoids the extra re-poll overhead between these bursts. That could help slightly with the video transfer portions. The other COGs do need to be suspended between bursts with all their state saved, and a new polling cycle has to restart so the higher priority COGs can have their access opportunities before the suspended burst transfer can be continued. This all takes extra P2 clock cycles, and can't really be helped.
Here's a capture of the HyperRAM clock activity (in cyan) for a 768 byte burst write request (at sysclk/2 transfer rate) being broken up into 3 x 256 byte transfers on the HyperRAM bus. Each burst could go on a bit longer, maybe up to about 360 bytes or so for the 4us limit, and double this if we can get to sysclk/1 operation. However for non-video COGs we may want to keep this burst limit even smaller than 4us to keep the latency for video requests down at higher resolutions or colour depths. I've been using 256 bytes as a good round number for RR COGs. This should probably be customisable to optimise performance.
I've started looking at the external memory to external memory copy operation and think I might like to add an optional offset to the source and/or destination HyperRAM addresses applied at the end of each burst. This would allow graphics memory to be copied more conveniently and it could then both pack and unpack graphics regions to/from another contiguous part of memory. It would be byte granular so it could easily support 8bpp and 16bpp and 32bpp bitmapped modes.
These offsets would be applied after each hub transfer burst has completed and some number of iterations of this sequence, essentially a scan line count, would be programmed into the request data as well for it to repeat down the screen. That way a COG could just request a block of graphics memory to be copied and it would be automatically completed in the background. I think this should be possible as an extension to the basic block copy code. Large groups of these commands could be setup to automatically copy various graphics items to the screen for applications like GUI use etc. Probably won't be super fast with small items due to the ~1 microsecond overhead per command but it could help the COG by doing things at a slightly higher level.
Did a bit more on this new external memory driver scheme today and wrapped up the fill operation and some of the portions for HyperRAM register read/write as extensions of the configuration commands possible. Right now I'm still mulling how to do the external memory copy operations. Rather than invent yet another duplication of the code path I want to leverage the read and write burst transfers as the primitives for the copy operations which seems to make more sense. Both for performance reasons and for minimising branching I've already replicated significant portions of the HyperRAM access code 3 times (for the reads, writes, and zero latency writes) and I don't want to expand it again further because I'd then be likely to run of of code space in the COG.
It's been possible to combine fills with both single and burst writes taking advantage of EXECF with skipping and that has saved me quite a bit of COG/LUT space. I just need to figure out all the different possibilities for the copy including the graphics scan line address offset options and design a structure that works for it to be able to retain the current state over multiple bursts when other high priority COGs can interleave their operations with the copy and also try to best re-use existing code blocks where possible. For high level commands like copy it's less important to save every cycle and minimizing branching through unrolling code everywhere because they will be transferring larger blocks anyway so any additional overhead/complexity required is divided over the entire operation. It's more important to keep the simpler operation code paths to be as short as possible while still having everything fit inside the COG.
Basically in the end I want this thing to be able to:
- transfer single byte/word/longs to/from external memory and HUB RAM (already coded)
- block fill bytes/words/long patterns to external memory (coded)
- transfer fixed or variable sized blocks to HUB from external mem (this is a fundamental primitive for copy, already fully coded)
- transfer fixed or variable sized blocks from HUB to external mem (ditto)
- copy a variable sized contiguous byte range between two external banks, or within the same bank (no bank overlap yet, the copy would just wraparound within the bank)
- copy a range of bytes from hub source to external memory then offset the hub source address by some value and repeat this for some number of scanlines (eg. saving some hub based frame buffer screen rect to contiguous external RAM)
- copy a range of bytes from external memory to hub memory then offset hub dest address by some value and repeat this for some number of scanlines (eg. restoring hub frame buffer screen rect from ext mem).
- copy a range of bytes from external memory to hub memory then offset both external source address and the external destination address by some (different) values and repeat this for some number of scanlines. Supporting different src/dest offsets will be very good for restoring smaller portions of previously saved graphics windows or other source graphics objects when the screen is repainted if windows are dragged around and any newly revealed portions get redrawn etc.
- other graphic copies to complete all possible src/dest cases (the full table is shown below)
- interwork all the above with the request list (some of this is coded)
- complete the transfer requests without interfering with high priority video or exceeding HyperRAM chip select limits, this means breaking up the larger bursts (both fill and primitive bursts are already being broken up in the code).
Here's the full list of copy cases I'm trying to now support, where bank1/bank2 are external memory banks, e.g. in HyperRAM (R/W) or HyperFlash (Read-only):
SRC DEST INC SRC ADDR INC DEST ADDR REPEATS Notes
per scanline per scanline by scanlines
===============================================================================
hub bank1 No No No Already coded
bank1 hub No No No Already coded
bank1 bank2 No No No Standard linear copy
bank1 hub Yes No Yes Graphics copies
bank1 hub No Yes Yes " "
bank1 hub Yes Yes Yes " "
hub bank1 Yes No Yes " "
hub bank1 No Yes Yes " "
hub bank1 Yes Yes Yes " "
bank1 bank2 Yes No Yes " "
bank1 bank2 No Yes Yes " "
bank1 bank2 Yes Yes Yes " "
I don't know if it's yet the case, but are you intending to support direct byte-granular-addressed xfer operations, between external devices, without having to forcefully pass thru P2 internal memory?
Not initially, but perhaps later if there is room in the COG for this direct copy it could be investigated for its potential 2-3x performance boost. It requires a significantly different code path for the transfer portion as well as different copy control logic that avoids the streamer entirely during the data phase so I'll try to keep it in mind during the coding. I expect much of the time there would only be a single bank involved with hub RAM, but some applications could benefit if they copy a lot of data between different Hyper devices sharing the same bus.
The other thing I haven't covered is copying between devices on two different buses. Needing two driver COGs as well as multiple mailbox groups etc, I think the only way to approach that is to have the requesting COG manage this copy itself through an API using some HUB RAM as the intermediary and controlling multiple reads/writes required to complete the transfer. It's a bit hard to fathom otherwise right now.
I think I can shave off some inter sub-burst cycles for the video transfers. The burst read code I have now will lock the bus for video COGs and generate multiple sub-bursts to complete their full transfers, without yielding to the poller between sub-bursts. Even though things are still not fully complete, by counting the instructions outside of when read transfer clocks are active I get the following ballpark numbers (which are still subject to change, but not by much):
Transfer setup instructions while !CS is high between sub-bursts : 16
Transfer setup instructions while !CS is low but no clock is active : 14
So for a 4us maximum !CS low time HyperRAM device and a 200MHz P2 (100 MIPS) we'd have:
4-0.16 = 3.84 us of clock active time and 0.14us CS high time.
This means the HyperRAM bus can be utilised for 3.84us out of every 4.14us or over 92% which is quite a lot better than I was seeing with the 1.2us sub-burst transfer gaps above. Of course this is only in cases where the COG can lock the bus for the entire transfer such as a video COG. Round Robin COGs won't have this luxury, unless there are no priority COGs configured and if you'd want this behaviour enabled.
Also the rest of the driver is finally coming together a lot better now. I'd been really bogged down with these new features I want such as lists and graphics copies and bank-to-bank transfers and this was slowing my thoughts about how it all goes together. I also reached a limit in the size of the COG/LUT RAM and that was a good thing as it forced me to go back and streamline the code and use EXECF/SKIPF more extensively which leverages code sharing and now makes it more likely to fit everything in. For example here's a snippet of some read setup code for all types of transfers.
' a b c d e f
' B W L B R L (a) byte
' Y O O U E O (b) word
' T R N R S C (c) long
' E D G S U K (d) new burst
' T M E (e) resumed sub-burst
' E D (f) locked sub-burst
r_single wrfast #0, ptrb ' a b c setup streamer hub address
mov c, #1 ' a | | read a single byte
mov c, #2 ' | b | read a single word
mov c, #4 ' | | c read a single long
wrlong #0, ptrb ' a b | clear out upper bits of byte/word
push complete_read_addr ' a b c reads will complete after this
r_burst tjz count, #noread ' | | | d check for any bytes to send
r_burst_resume_lut setnib bankparams, request, #0 ' | | | d e get bank parameter LUT address
rdlut b, bankparams ' | | | d e get bank limit/mask
bmask mask, b ' | | | d e build mask for addr
shr b, clockdiv ' | | | d e scale burst size based on freq
fle limit, b ' | | | d e apply any per bank limit to cog limit
r_locked_burst mov c, count ' | | | d e f get count of bytes left to read
mov addrhi, addr1 ' a b c d e f setup address to read from
wrfast #0, hubdata ' | | | d e f setup streamer hub addr
fle c, limit wc ' | | | d e f enforce the burst limit
setword xrecv, c, #0 ' a b c d e f setup streamer count for burst
if_c push continue_read_addr ' | | | d e f burst read will continue
if_nc push complete_read_addr ' | | | d e f burst read will complete
setnib deviceaddr, request, #0 ' a b c d e | get the bank's pin config address
rdlut pinconfig, deviceaddr ' a b c d e | get the pin config for this bank
getbyte cspin, pinconfig, #0 ' a b c d e | byte 0 holds CS pin
getbyte clkpin, pinconfig, #1 ' a b c d e | byte 1 holds CLK pin
getbyte rwdspin, pinconfig, #2 ' a b c d e | byte 2 holds RWDS pin
getbyte latency, pinconfig, #3 ' a b c d e f byte 3 holds latency clock edges
It looks like I might not be able to get the SPI flash integrated into this same HyperRAM driver as I had once envisaged might somehow be possible. I was sort of hoping we might be able to do transfers directly from SPI flash to HyperRAM etc inside this driver (including graphics transfers). But for now if anyone needs to do that they'll probably have to manage it themselves via the HUB RAM as their intermediate transfer buffer.
Initially this driver will only be able to manage HyperRAM and HyperFlash devices on a common bus. If I can I'll try to add an external driver expansion option (via hub-exec), but right now I don't think it could work with graphics transfers and bank to bank copies because part of that code is coupled closely to HyperRAM/HyperFlash devices and requires running in the streamer's overlap time etc. But I'll look at what else might be required there if I can eventually make it work somehow. I'm thinking it probably doesn't make a lot of sense in video applications to transfer via SPI because of the lower speeds and the limited number of transfers possible before you need to yield to a video COG. The setup overhead would be very significant in this case and the effective transfer bandwidth would get hammered hard. In non-video applications where yielding is not required it might be more useful for block transfers etc at application setup time.
I'm now getting close to being feature complete in the code (still fully untested) and it is currently consuming 463 COGRAM longs, and 492 LUTRAM longs. A rough breakdown is this:
COGRAM:
- Mailboxes: 24 longs
- Per COG handler code: 96 longs unrolled
- Service jump table for 16 banks: 128 longs - if needed this could be shrunk down to 8 longs per device type, at a cost to execution time by a handful of extra instructions, maybe 10 more cycles per request?
- All polling + config code + error handling ~ 107 longs
- Data / other state / streamer commands / EXECF vectors etc : ~ 100 longs
LUTRAM:
- Per bank state: 32 longs (2 per bank)
- Per COG list execution state: 80 longs (10 per COG)
- All Hyper device transfer code (Reads, Writes & zero latency accesses) + copy/fill handling: 380 longs
I still need to clean up some config commands and add some more error handling, but I'm confident what I want to do will ultimately fit (if not I'll make it fit!).
It's not in the polling loop, it's the processing time after the poller has found something to do. So the extra penalty is per request. I'll do it if I find I need to free up space. Right now I think I can still get it to all (just) fit with this 128 entry service table, which is what allows the many bank to one device mapping giving us the address range flexibility per device. The other way is to store a smaller lookup table base offset in a per bank LUT register and look it up first instead of jumping directly to the 8 bit request value that already nicely falls into the 128-255 COG RAM range.
Perhaps the overhead is not quite so bad as 5 instructions (10 clocks) for shrinking the execf table size down.
My current per COG code with a 128 entry service table currently does this to branch to the service requested....
getbyte request, addr1, #3 'get request + bank info
altd request, #0 'lookup jump vector service table
execf request-0 'jump to service
The code to possibly shrink this table down to just 16 longs (8 HyperRAM vectors + 8 HyperFlash vectors), plus a per bank base vector address of 16 further longs could use something like this:
getnib bank, addr1, #6 'get bank from request
getnib request, addr1, #7 'extract upper request nibble values (8-15)
alts bank, #vectorbase 'get per bank vector table address
altd request, #0-0 'add to base and
execf request-0 'jump to service for bank
So I could then possibly store the device's table base in the COG RAM (16 longs) and burn 4 extra clocks to save myself 128 - 16 - 16 - (2x8 COGS) = 80 COG RAM longs if required at some point for the two supported device types. This was for singular requests, the complex requests may introduce some further penalty. TBD.
I suppose you could even make a bytecode executor using XBYTE that had just several instructions, including a loop, so that you could actually program the HyperRAM server.
I wonder if we will need a per HyperRAM/HyperFlash device delay parameter to compensate differently for the wait time required at the P2's data bus input pins for different path delays each device takes getting results back into the P2, or whether a single global delay value (which is what I currently use) is all that is required?
I could make the input delay a per device parameter instead of a single common value configured once at COG init time, but it will probably add 8 more clocks per read transaction to set it up which would be nice to avoid.
eg. It would need something like this is the code path
testb bankparams, #7 wz ' flag to indicate to use a registered data bus input
if_nz wrpin #0, datapins
if_z wrpin regd, datapins ' regd = #%100_000_000_00_00000_0
getbyte delay, bankparams, #1
So do we think that a multiple device setup will need independent input timing compensation? I'm now sort of thinking we might...
In case the above wasn't clear this is the delay we require between starting the clock ouput and the streamer command to read back the result into P2 memory. It varies with P2 clock frequency (and probably temperature).
E.g.
wypin clks, clkpin 'setup number of transfer clocks
waitx delay 'tuning delay
xinit xrecv, #0 'start data transfer
Thruth is: newer Octas seems to include test-pattern exercise abilities into their specs, in order to verify data path and timing integrity, between controller (P2) and device(s), in a per-device basis.
Even Hypers did experienced some changes, in order to agree with HyperBus 2 specs.
So, despite the extra code needed, IMHO it's better to be able to do it the same way, as for the ya-old ones, as for the new ones.
Yeah I've heard about that inbuilt test pattern for OctaRAMs. Another benefit it with making it variable is that it gives us a way to control it dynamically after startup in case that is required for future experimentation etc. Right now it is using a static table (thanks go to ozpropdev who figured out), to compute the delay+registered input pin settings based on P2 clock frequency at driver COG init time.
In theory the driver (or client) could also write out its own test patterns at startup time into HyperRAM (which would obviously corrupt it), and then read it back with various values of delay+registered setting which could then help optimise it for the system conditions including the current temperature.
Perhaps one day something else could track temperature and then make slight adjustments if required before the data is corrupted (probably gets difficult unless you've already characterized all of this in advance).
I sincerely believe that, by ensuring CA phase does really agrees with Hyper timing specs, any eventual corruption could be limited to happen within a single row, perhaps even within just a few words-interval of the affected row, inside main memory array address range of each dice (in case of MCPs).
Since we can't really afford inter-dice "jumps", because bursts would wrap-around within die limits, by sparing, says, the first or last row of each die, in order to have a "dirty" or scratchpad area to get the tests done, could efficiently deal with any tests we could immagine.
That way, any eventual out-of-bonds corruption would have to be credit to problems related to the possibility of having any someway "shrunken", or even skiped, HyperCk cycles, that could't be understood as such (a valid clock cycle) by the Hyper, thus messing with its logic, while trying to keep with HyperBus controller (P2) pace.
Perhaps, by using HyperCk = Sysclk/4 during CA phase and the first (and mandatory) Latency count, thus reserving any increase at HyperCK to Sysclk/2 to only happen afterwards, during the second Latency count (if needed, due to device type) and beyound, throughout the whole data-transfer phase could be enough, to keep Hyper accesses steady along the time, despite any reasonable temperature changes.
Some evidence that we are not the unique whose are facing problems dealing with Hypers (and now, Octas), can be extracted from two well know facts:
- the mere existence of the DCARS variation (using two "extra" clocks, PSC and PSC#) since Hypers early days, ever seemed strange to my eyes (and nose; kind of rotten cheese smell);
- and now, the jewell of the crown: the (re-) inception (or, eventually, re-heal from graves of unknown early stages) of "tinkerbell" twins, HyperCK/HyperCK#, now available to be used even with 3V devices. What is being sold as an improvement, can also be a way to kick/slip a bit of dirt, under the rugs...
One must be deaf to don't heard something about those kind of dissonaces, ever hitting at the same spot; timing, timing, timing... Ouch!
Yeah address wrapping in my driver happens for each bank which can be mapped per device in a shared device package or one with multiple memory banks. I configure an address mask per bank (e.g. 16MB=$fffff, 32MB=$1fffff etc) which I apply to the external address after each address increment stage at the end of each sub-burst transferred, and copy/fill operations will also wrap this way. An outer SPIN client layer could always break apart fills/copies to work linearly across devices should it choose to but this is not something the PASM driver will figure out for you.
The only issue I see is that with 128MB devices, masking this way could potentially trigger a control access given I've reserved bank 15 for control/configuration commands. This can basically happen when crossing the 240MB boundary which could then access bank 15 after I apply the $7ffff mask. Right now I am not handling that case specifically but might do if there is a fast way to check it. It can be avoided by not putting any 128MB devices at this 128-256MB address range, but mapping them to 0-128MB instead and keeping the upper address range for smaller devices that avoid the 240-256MB area. I'm not sure there are any single bank 1Gbit Hyper memory devices yet, but they'll probably arrive one day, with flash first.
Whicker did told us, in a post, that there are even four-dice stacked devices; by now, just the 1.8V ones, but, as for some kind of food, what doesn't kills me, inflates my belly...
At some point, in the future, you'll face the need to craft a lot of tables, and ways to manage their use, efficiently, or you'll face the risk of passing an enormous part of the whole driver operation, trying to figure wich way each device would need to be configured, and used, along the accesses.
You can be given a Medal of Merit by being the first to figure out, how some kind of AI, even mild, can be crafted and used, with P2, just because you need it, as fresh air, in order to be able to take a deep breath!
Actually we may be spared of the issue I identified in the prior post. Looking at my code I think I only use the initial address from the first request to determine the operation and never after the address is incremented so in theory as long as you don't start to access above 240MB in the top 128MB bank you are going to be okay even if things wrap. In fact this will let you access the upper 16MB if you want to do read/write bursts to/from that area when starting below 240MB. If you ever try to start accessing it from above 240MB it will look like a driver configuration request.
Whicker did told us, in a post, that there are even four-dice stacked devices; by now, just the 1.8V ones, but, as for some kind of food, what doesn't kills me, inflates my belly...
At some point, in the future, you'll face the need to craft a lot of tables, and ways to manage their use, efficiently, or you'll face the risk of passing an enormous part of the whole driver operation, trying to figure wich way each device would need to be configured, and used, along the accesses.
You can be given a Medal of Merit by being the first to figure out, how some kind of AI, even mild, can be crafted and used, with P2, just because you need it, as fresh air, in order to be able to take a deep breath!
Yes things can get complicated quickly. This driver is probably not going to be universal, but something specific to some range of HyperRAM/HyperFlash devices. The parameters that can be configured per bank/device and/or used as part of the data transfer logic are:
- size
- latency
- CLK, !CS, RWDS pins - (a reset pin also optionally defined but used independently).
- burst size : e.g. Hyper flash would be up to $ffff (due to streamer limit), while the HyperRAM burst is computed to limit the chip select low time to 4us (including overheads), and further per COG limits can apply to this burst size for allowing video priority, latency bounding etc.
- type (RAM/Flash) - this defines the 8 operations possible, which R/W code to use and what access sizes are allowed (e.g. flash is via 16 bit data access only, not 8 or 32 bits)
- delay+registered input flag per bank is being considered
- internal registers of each device can be controlled
The transfer speed is not a per bank setting, but is global. Writes are done at (sysclk/2) bytes/second and reads are either (sysclk/2) or (sysclk/1) bytes/second (selectable).
After tidying up the driver code on and off for the last week I think it's sort of ready to start testing most of what I have now. This driver code is currently consuming 470 longs in COGRAM, and 490 longs of LUT RAM, this is without the service vector table size reduction I discussed earlier that might free up 80 more COGRAM longs if I ever need it (which I don't think I will right now).
It has been taking me a lot longer to sort out that I'd hoped as all the skipping about in the code to share pathways complicated things and definitely makes it harder to track what is going on. For optimising different branches I use various combinations of calls/jumps/execf/skipfs/"_ret_ push d" etc and don't need to keep the stack intact which makes tracking things a bit harder. Compared to my video work which can give you instant feedback/gratification as you test it I'll have to say this was not a fun development and I've also done things in dribs and drabs to avoid losing interest which has probably slowed me more than normal. I think once I finally start testing and get some results and see it all working again with my video driver I think I'll be able to speed up progress further.
From my POV, in that part of your driver lies one of the best decisions you'd took!
Writes need to be solid as depleted uranium, so you can be sure that whatever you did put in there, is being kept at the right place, and with right values.
Faster read ops can be attempted, adjusted, reconfigured, even on the fly, but you need to trust what and where certain basic things are, mainly to be able to check for any viable "test patterns".
Had some more time to get this driver tested. I finally have a stable environment with all the service code needed for testing and so far I've tried burst reads/writes/fills, and individual byte/word/long accesses. My driver code paths are no longer hanging (they were at first until I resolved some bugs) and my logic analyser shows me the HyperRAM bus signal clock transitions the way I expect which is good.
I am still proceeding with testing request lists and graphics copy logic next and when I'm satisfied with it I'll speed up the P2 to full speed and try out fitting the P2-EVAL with the real Hyper module again.
By the way, a common mistake I seem to make is to write this
test x, #bit wz
instead of
testb x, #bit wz
This drives me crazy - and I type it a lot for some reason. Some hangover from the P1 I guess.
Here's an output grab of the clock pin for a single long read from an odd byte address with 4 latency clocks. Unlike the write operations I can't quite get it down to back to back clocks between address and data phase because of bus turnaround instructions and waiting the right amount of input delay and needing to potentially switch the data rate to sysclk/1 in between (which I don't want to do until the first burst finishes), but it's still pretty fast as you only use up around two extra memory clock cycles for doing this. For long bursts it will hardly impact things much compared to all the other overheads, but if I can tighten it any further I will try to do that too.
Update:
By the way, the COGRAM usage is up to 489 longs and the HUB LUT RAM is now 491 longs. It's getting tight but it all still fits - plus I do also have a little bit of extra debug stuff in there too which I can remove at some point and I can rebalance some short code functions between these RAMs if required. I don't think I need to add anything else to this code now apart from any bug fixes which should be small.
Comments
Yeah I've heard about these combo parts. This driver design allows multiple banks to nominate the same clock pin so these parts could also be supported just with two different chip select pins. All transfers complete in each bank with the clock left at the same low level before another bank can be accesses so they should work out nicely.
LOL. Yeah I'm not planning to try it anytime soon. I suspect it could get very tricky to get it right and very dependent on interleaving address phases/latency periods of both devices to avoid bus clashes. A lot of mucking about and it's back to some slower byte banging/clock control at the start of the transfers.
Breaking up longer bursts is required for two reasons:
1) it is required so that a long burst from a non-video COG can't significantly delay video, ensuring video data will make it back in time to be displayed
2) it is required so the chip select low time for a transfer to/from HyperRAM is less than 4us, avoiding refresh issues.
This result is showing that for HyperRAM the best utilisation on the bus is generally not going to be a lot higher than about 80% (when large bursts can consume most of the 4us of time available), unless the P2 can be clocked even higher than 252MHz or this 1us can be scaled back further. Significant code optimisations will probably be limited due to the processing work involved. The only thing I might do is add something in the code where if the requesting COG of a broken up burst is already the highest priority COG, the code avoids the extra re-poll overhead between these bursts. That could help slightly with the video transfer portions. The other COGs do need to be suspended between bursts with all their state saved, and a new polling cycle has to restart so the higher priority COGs can have their access opportunities before the suspended burst transfer can be continued. This all takes extra P2 clock cycles, and can't really be helped.
Here's a capture of the HyperRAM clock activity (in cyan) for a 768 byte burst write request (at sysclk/2 transfer rate) being broken up into 3 x 256 byte transfers on the HyperRAM bus. Each burst could go on a bit longer, maybe up to about 360 bytes or so for the 4us limit, and double this if we can get to sysclk/1 operation. However for non-video COGs we may want to keep this burst limit even smaller than 4us to keep the latency for video requests down at higher resolutions or colour depths. I've been using 256 bytes as a good round number for RR COGs. This should probably be customisable to optimise performance.
These offsets would be applied after each hub transfer burst has completed and some number of iterations of this sequence, essentially a scan line count, would be programmed into the request data as well for it to repeat down the screen. That way a COG could just request a block of graphics memory to be copied and it would be automatically completed in the background. I think this should be possible as an extension to the basic block copy code. Large groups of these commands could be setup to automatically copy various graphics items to the screen for applications like GUI use etc. Probably won't be super fast with small items due to the ~1 microsecond overhead per command but it could help the COG by doing things at a slightly higher level.
It's been possible to combine fills with both single and burst writes taking advantage of EXECF with skipping and that has saved me quite a bit of COG/LUT space. I just need to figure out all the different possibilities for the copy including the graphics scan line address offset options and design a structure that works for it to be able to retain the current state over multiple bursts when other high priority COGs can interleave their operations with the copy and also try to best re-use existing code blocks where possible. For high level commands like copy it's less important to save every cycle and minimizing branching through unrolling code everywhere because they will be transferring larger blocks anyway so any additional overhead/complexity required is divided over the entire operation. It's more important to keep the simpler operation code paths to be as short as possible while still having everything fit inside the COG.
Basically in the end I want this thing to be able to:
- transfer single byte/word/longs to/from external memory and HUB RAM (already coded)
- block fill bytes/words/long patterns to external memory (coded)
- transfer fixed or variable sized blocks to HUB from external mem (this is a fundamental primitive for copy, already fully coded)
- transfer fixed or variable sized blocks from HUB to external mem (ditto)
- copy a variable sized contiguous byte range between two external banks, or within the same bank (no bank overlap yet, the copy would just wraparound within the bank)
- copy a range of bytes from hub source to external memory then offset the hub source address by some value and repeat this for some number of scanlines (eg. saving some hub based frame buffer screen rect to contiguous external RAM)
- copy a range of bytes from external memory to hub memory then offset hub dest address by some value and repeat this for some number of scanlines (eg. restoring hub frame buffer screen rect from ext mem).
- copy a range of bytes from external memory to hub memory then offset both external source address and the external destination address by some (different) values and repeat this for some number of scanlines. Supporting different src/dest offsets will be very good for restoring smaller portions of previously saved graphics windows or other source graphics objects when the screen is repainted if windows are dragged around and any newly revealed portions get redrawn etc.
- other graphic copies to complete all possible src/dest cases (the full table is shown below)
- interwork all the above with the request list (some of this is coded)
- complete the transfer requests without interfering with high priority video or exceeding HyperRAM chip select limits, this means breaking up the larger bursts (both fill and primitive bursts are already being broken up in the code).
Here's the full list of copy cases I'm trying to now support, where bank1/bank2 are external memory banks, e.g. in HyperRAM (R/W) or HyperFlash (Read-only):
Transfer setup instructions while !CS is high between sub-bursts : 16
Transfer setup instructions while !CS is low but no clock is active : 14
So for a 4us maximum !CS low time HyperRAM device and a 200MHz P2 (100 MIPS) we'd have:
4-0.16 = 3.84 us of clock active time and 0.14us CS high time.
This means the HyperRAM bus can be utilised for 3.84us out of every 4.14us or over 92% which is quite a lot better than I was seeing with the 1.2us sub-burst transfer gaps above. Of course this is only in cases where the COG can lock the bus for the entire transfer such as a video COG. Round Robin COGs won't have this luxury, unless there are no priority COGs configured and if you'd want this behaviour enabled.
Also the rest of the driver is finally coming together a lot better now. I'd been really bogged down with these new features I want such as lists and graphics copies and bank-to-bank transfers and this was slowing my thoughts about how it all goes together. I also reached a limit in the size of the COG/LUT RAM and that was a good thing as it forced me to go back and streamline the code and use EXECF/SKIPF more extensively which leverages code sharing and now makes it more likely to fit everything in. For example here's a snippet of some read setup code for all types of transfers.
Initially this driver will only be able to manage HyperRAM and HyperFlash devices on a common bus. If I can I'll try to add an external driver expansion option (via hub-exec), but right now I don't think it could work with graphics transfers and bank to bank copies because part of that code is coupled closely to HyperRAM/HyperFlash devices and requires running in the streamer's overlap time etc. But I'll look at what else might be required there if I can eventually make it work somehow. I'm thinking it probably doesn't make a lot of sense in video applications to transfer via SPI because of the lower speeds and the limited number of transfers possible before you need to yield to a video COG. The setup overhead would be very significant in this case and the effective transfer bandwidth would get hammered hard. In non-video applications where yielding is not required it might be more useful for block transfers etc at application setup time.
I'm now getting close to being feature complete in the code (still fully untested) and it is currently consuming 463 COGRAM longs, and 492 LUTRAM longs. A rough breakdown is this:
COGRAM:
- Mailboxes: 24 longs
- Per COG handler code: 96 longs unrolled
- Service jump table for 16 banks: 128 longs - if needed this could be shrunk down to 8 longs per device type, at a cost to execution time by a handful of extra instructions, maybe 10 more cycles per request?
- All polling + config code + error handling ~ 107 longs
- Data / other state / streamer commands / EXECF vectors etc : ~ 100 longs
LUTRAM:
- Per bank state: 32 longs (2 per bank)
- Per COG list execution state: 80 longs (10 per COG)
- All Hyper device transfer code (Reads, Writes & zero latency accesses) + copy/fill handling: 380 longs
I still need to clean up some config commands and add some more error handling, but I'm confident what I want to do will ultimately fit (if not I'll make it fit!).
My current per COG code with a 128 entry service table currently does this to branch to the service requested....
The code to possibly shrink this table down to just 16 longs (8 HyperRAM vectors + 8 HyperFlash vectors), plus a per bank base vector address of 16 further longs could use something like this: So I could then possibly store the device's table base in the COG RAM (16 longs) and burn 4 extra clocks to save myself 128 - 16 - 16 - (2x8 COGS) = 80 COG RAM longs if required at some point for the two supported device types. This was for singular requests, the complex requests may introduce some further penalty. TBD.
I could make the input delay a per device parameter instead of a single common value configured once at COG init time, but it will probably add 8 more clocks per read transaction to set it up which would be nice to avoid.
eg. It would need something like this is the code path
So do we think that a multiple device setup will need independent input timing compensation? I'm now sort of thinking we might...
E.g.
Even Hypers did experienced some changes, in order to agree with HyperBus 2 specs.
So, despite the extra code needed, IMHO it's better to be able to do it the same way, as for the ya-old ones, as for the new ones.
Just a thought
In theory the driver (or client) could also write out its own test patterns at startup time into HyperRAM (which would obviously corrupt it), and then read it back with various values of delay+registered setting which could then help optimise it for the system conditions including the current temperature.
Perhaps one day something else could track temperature and then make slight adjustments if required before the data is corrupted (probably gets difficult unless you've already characterized all of this in advance).
Since we can't really afford inter-dice "jumps", because bursts would wrap-around within die limits, by sparing, says, the first or last row of each die, in order to have a "dirty" or scratchpad area to get the tests done, could efficiently deal with any tests we could immagine.
That way, any eventual out-of-bonds corruption would have to be credit to problems related to the possibility of having any someway "shrunken", or even skiped, HyperCk cycles, that could't be understood as such (a valid clock cycle) by the Hyper, thus messing with its logic, while trying to keep with HyperBus controller (P2) pace.
Perhaps, by using HyperCk = Sysclk/4 during CA phase and the first (and mandatory) Latency count, thus reserving any increase at HyperCK to Sysclk/2 to only happen afterwards, during the second Latency count (if needed, due to device type) and beyound, throughout the whole data-transfer phase could be enough, to keep Hyper accesses steady along the time, despite any reasonable temperature changes.
- the mere existence of the DCARS variation (using two "extra" clocks, PSC and PSC#) since Hypers early days, ever seemed strange to my eyes (and nose; kind of rotten cheese smell);
- and now, the jewell of the crown: the (re-) inception (or, eventually, re-heal from graves of unknown early stages) of "tinkerbell" twins, HyperCK/HyperCK#, now available to be used even with 3V devices. What is being sold as an improvement, can also be a way to kick/slip a bit of dirt, under the rugs...
One must be deaf to don't heard something about those kind of dissonaces, ever hitting at the same spot; timing, timing, timing... Ouch!
The only issue I see is that with 128MB devices, masking this way could potentially trigger a control access given I've reserved bank 15 for control/configuration commands. This can basically happen when crossing the 240MB boundary which could then access bank 15 after I apply the $7ffff mask. Right now I am not handling that case specifically but might do if there is a fast way to check it. It can be avoided by not putting any 128MB devices at this 128-256MB address range, but mapping them to 0-128MB instead and keeping the upper address range for smaller devices that avoid the 240-256MB area. I'm not sure there are any single bank 1Gbit Hyper memory devices yet, but they'll probably arrive one day, with flash first.
https://forums.parallax.com/discussion/comment/1490880/#Comment_1490880
At some point, in the future, you'll face the need to craft a lot of tables, and ways to manage their use, efficiently, or you'll face the risk of passing an enormous part of the whole driver operation, trying to figure wich way each device would need to be configured, and used, along the accesses.
You can be given a Medal of Merit by being the first to figure out, how some kind of AI, even mild, can be crafted and used, with P2, just because you need it, as fresh air, in order to be able to take a deep breath!
Yes things can get complicated quickly. This driver is probably not going to be universal, but something specific to some range of HyperRAM/HyperFlash devices. The parameters that can be configured per bank/device and/or used as part of the data transfer logic are:
- size
- latency
- CLK, !CS, RWDS pins - (a reset pin also optionally defined but used independently).
- burst size : e.g. Hyper flash would be up to $ffff (due to streamer limit), while the HyperRAM burst is computed to limit the chip select low time to 4us (including overheads), and further per COG limits can apply to this burst size for allowing video priority, latency bounding etc.
- type (RAM/Flash) - this defines the 8 operations possible, which R/W code to use and what access sizes are allowed (e.g. flash is via 16 bit data access only, not 8 or 32 bits)
- delay+registered input flag per bank is being considered
- internal registers of each device can be controlled
The transfer speed is not a per bank setting, but is global. Writes are done at (sysclk/2) bytes/second and reads are either (sysclk/2) or (sysclk/1) bytes/second (selectable).
It has been taking me a lot longer to sort out that I'd hoped as all the skipping about in the code to share pathways complicated things and definitely makes it harder to track what is going on. For optimising different branches I use various combinations of calls/jumps/execf/skipfs/"_ret_ push d" etc and don't need to keep the stack intact which makes tracking things a bit harder. Compared to my video work which can give you instant feedback/gratification as you test it I'll have to say this was not a fun development and I've also done things in dribs and drabs to avoid losing interest which has probably slowed me more than normal. I think once I finally start testing and get some results and see it all working again with my video driver I think I'll be able to speed up progress further.
Writes need to be solid as depleted uranium, so you can be sure that whatever you did put in there, is being kept at the right place, and with right values.
Faster read ops can be attempted, adjusted, reconfigured, even on the fly, but you need to trust what and where certain basic things are, mainly to be able to check for any viable "test patterns".
I am still proceeding with testing request lists and graphics copy logic next and when I'm satisfied with it I'll speed up the P2 to full speed and try out fitting the P2-EVAL with the real Hyper module again.
By the way, a common mistake I seem to make is to write this
instead of
This drives me crazy - and I type it a lot for some reason. Some hangover from the P1 I guess.
Here's an output grab of the clock pin for a single long read from an odd byte address with 4 latency clocks. Unlike the write operations I can't quite get it down to back to back clocks between address and data phase because of bus turnaround instructions and waiting the right amount of input delay and needing to potentially switch the data rate to sysclk/1 in between (which I don't want to do until the first burst finishes), but it's still pretty fast as you only use up around two extra memory clock cycles for doing this. For long bursts it will hardly impact things much compared to all the other overheads, but if I can tighten it any further I will try to do that too.
Update:
By the way, the COGRAM usage is up to 489 longs and the HUB LUT RAM is now 491 longs. It's getting tight but it all still fits - plus I do also have a little bit of extra debug stuff in there too which I can remove at some point and I can rebalance some short code functions between these RAMs if required. I don't think I need to add anything else to this code now apart from any bug fixes which should be small.
LUT RAM 491?