Looks like your cepin is setup as 3 bits too high above pin 8 and immediately clobbering the prior outh instruction.
Instead of this:
srcmd_add long %1000<<CEPIN 'sr_addr + 1
maybe try this:
srcmd_add long 1<<CEPIN 'sr_addr + 1
EDIT: On second thoughts, scrap that. You seem to have your address bus shifted up some bits. I have mine starting at bit zero. However the first mov outa, sr_addr doesn't keep pin 8 set to 1 which I think is the issue.
@rogloh said:
However the first mov outa, sr_addr doesn't keep pin 8 set to 1 which I think is the issue.
Yep, that was it. And I'd only just moved it into that position in the source too. It should have rung an alarm much earlier. These things happen. PS: I've been trying to optimise again.
PPS: Sigh, now I've got an ugly NOP sitting there padding the XINIT.
PPPS: I do have the smartpin sequencing down to four instructions now. That makes me feel better.
@evanh said:
I'd use a preprogrammed PAL/CPLD chip as the external counter. Then it can be placed on the 8-bit databus as well. With this arrangement the entire address would be loaded into it a byte at a time. It does give you much lower access latency and avoids the refresh complications of PS(D)RAMs.
Could get creative with features like single byte sized address updates packed into the CPLD.
If you want a 'latch with party tricks', then a CPLD would do (I wrote some with 22V10s and XC9500s, ages ago) but that's overkill, really. A 74xx373 8-bit latch (or, more likely, two or three of them) could each latch off their share of the address from the data bus.
The tradeoff (it's always something!) is that you need more bus cycles to load each address latch, and another pin signal for each latch, but the whole thing could come in at 13 pins.
(DB0-7, ALE's 0,1,2, /[ram]WR, /[ram]RD. /CS can be grounded if you're only using one SRAM chip).
In the application I currently have in mind (I've got several more for the P2 - Those smart-pins!) my external RAM access is going to be very sequential - Write a lot of data while "offline" (don't care about speed), then reset the address counter to zero, and start reading "online" (fast enough to feed the algorithm(s).) Random access doesn't really come into it.
Still, I imagine for most it will, so lots of ALE signals to pull address bits off the data bus are not a bad idea at all!
Have fun, S.
ETA a bit more: Since the P2 does pretty much everything in 32 bits, and at the moment we're assuming an eight-bit-wide SRAM, using a 16V8 as a six-bit latch with a two-bit counter on the low bits would cheerfully save quite a few bus cycles when reading or writing long values. Latches with party tricks...
@rogloh said:
However the first mov outa, sr_addr doesn't keep pin 8 set to 1 which I think is the issue.
Yep, that was it. And I'd only just moved it into that position in the source too. It should have rung an alarm much earlier. These things happen. PS: I've been trying to optimise again.
PPS: Sigh, now I've got an ugly NOP sitting there padding the XINIT.
LOL, "Premature optimization is the root of all evil", I was once told by a colleague.
A problem with ALE type latches is that once you wrap beyond the end of the 8 bit address group you've wired in directly (for example), then you need to update the latch again (or pulse a counter perhaps). This means the burst transfer has to stop right at this point and do something else. It's okay if you've always coded for this specific memory page size/architecture, but to keep it generic you have to compute where to stop the transfer and pulse a clock or load the next latch. It wouldn't (easily) suit a general purpose software driver designed to stream arbitrarily large bursts at any address but only a specifically coded one for that HW implementation.
Or you need latches with better party tricks - Imagine instead each latch is a loadable counter (74xx163, four bits a pop?*) and their 'count' pin is hooked up to to your 'burst read', so you start by loading up the start address and the "latches" and the address count up with every read/write pulse.
Understood, we're getting tricky here, saving pins by 'being clever in external hardware' and there are other tradeoffs as well. Seems to me, though, that a driver could accept 'How big is my page?' as an argument, and use that to 'stop and reload address'. Dunno myself - I'm not a driver writer.
Have fun, S.
*A 74F283 might be better, because otherwise the carry out from each counter could lollygag their way up for some time before arriving at the top bits
Read code is complete for full feature integration into my driver (untested, delay code might need adjustment). Gonna work on the write code then test with simulated SRAM timing until some board is available.
This code does not use special pin control optimizations but allows control pins to be put anywhere and the data bus just needs to be on any 8 pin group boundary. It also allows multiple banks with different CS pins. The only constraint is the address bus group needs to be located with it's base pin (A0) on either P0 (port A) or P32 (port B ), this makes the address setup/wrapping/masking simple and doesn't require extra shifting.
' SRAM READS
' a b c d e f
' B W L B R L (a) byte read
' Y O O U E O (b) word read
' T R N R S C (c) long read
' E D G S U K (d) new burst read
' T M E (e) resumed sub-burst
' E D (f) locked sub-burst
r_single mov resume, complete_rmw ' a b c set resume address
wrfast xfreq1, ptrb ' a b c setup streamer hub address for singles
wrlong #0, ptrb ' a b | clear unwanted bits of long result
andn addr1, #1 ' | b | align word address to prevent address wrap
andn addr1, #3 ' | | c align long address to prevent address wrap
r_burst mov orighubsize, count ' | | | d preserve the original transfer count
tjz count, #noread_lut ' | | | d check for any bytes to send
r_resume_burst getnib b, request, #0 ' a b c d e get bank parameter LUT address
rdlut b, b ' a b c d e get bank limit/mask
bmask mask, b ' a b c d e build mask for addr
getbyte delay, b, #1 ' a b c d e get input delay of bank + flags
p0 shr b, #17 ' | | | d e scale burst size based on bus rate
fle limit, b ' | | | d e apply any per bank limit to cog limit
r_locked_burst wrfast xfreq1, hubdata ' | | | d e f setup streamer hub address for bursts
mov c, count ' | | | d e f get count of bytes left to read
fle c, limit wc ' | | | d e f enforce the burst limit
mov c, #1 wc ' a | | | | | set transfer length of a byte
mov c, #2 wc ' | b | | | | set transfer length of a word
mov c, #4 wc ' | | c | | | set transfer length of a long
if_c mov resume, continue_read ' | | | d e f burst read will continue
if_nc mov resume, complete_rw ' | | | d e f burst/single read will complete
shr delay, #5 wc ' a b c d e | prep delay and test for registered inputs
bitnc regdatabus, #16 ' a b c d e | setup if data bus is registered or not
setnib deviceaddr, request, #0 ' a b c d e | get the bank's pin config address
skipf #%111111000 ' a b c | | | skip the burst transfer test for single reads
rdlut pinconfig, deviceaddr ' a b c d e | get the pin config for this bank
getbyte cspin, pinconfig, #0 ' a b c d e | byte 0 holds CS pin
getbyte oepin, pinconfig, #1 ' a b c d e | byte 1 holds OE pin
mov d, addr1 ' | | | d e f get start address
and d, mask ' | | | d e f only keep address bits
subr d, mask ' | | | d e f figure out how many bytes remain before the last
add d, #1 ' | | | d e f allow the last address as well
fle c, d wc ' | | | d e f compare this size to our transfer size and limit it
if_c mov resume, continue_read ' | | | d e f if require we will continue with a sub-burst again
read_common setword xrecvdata, c, #0 'adjust the byte transfer clocks needed in streamer
setxfrq xfreq2 'setup streamer frequency (sysclk/2)
setq mask 'setup address pin mask
patchport0 muxq outa, addr1 'set starting address of transfer
wrpin regdatabus, datapins 'setup if pins registered or not
drvl cspin 'activate CS
drvl oepin 'and OE
xinit xrecvdata, #0 'begin streamer
waitx delay 'apply delay
rep #1, c 'repeat for number of bytes to transfer
patchport1 add outa, #1 'increment address
call resume 'see what to do next for list processing, RMW, gfx copy etc
drvh oepin 'drive OE pin high
_ret_ drvh cspin 'and release active CS
I've optimised the routine for single byte write down to six instructions. Just three for the smartpin.
sram_wrbyte
mov outa, sr_addr 'preset address bus (This will clobber any preceding OUTH actions!)
wxpin sp_fast, #WEPIN 'sysclock/2 pulse period (registered pin makes it appear two clocks after instruction)
wypin #1, #WEPIN 'start the WE smartpin pulses on next period
setbyte outa, pa, #0 'write data byte to data bus
outl #CEPIN 'release SRAM
_ret_ wxpin #1, #WEPIN 're-prime the smartpin
The two WXPINs sort of provide an XINIT effect around the WYPIN.
The details are a little more subtle:
The last WXPIN puts the smartpin's "base period", which is the pulse period, timer into a continuous reset-on-every-tick of sysclock. This then allows an instant new period to be set without the otherwise obligatory DIRL/DIRH pair.
In routines that use the streamer there is a third WXPIN needed but here we can get away with one less by virtue of proximity of the SETBYTE. The WE pulse from the WYPIN is able to be delayed just enough to align with the data out without extra compensation. This is partly achieved by making the CE pin registered while the data pins are unregistered.
@evanh said:
I've optimised the routine for single byte write down to six instructions. Just three for the smartpin.
sram_wrbyte
mov outa, sr_addr 'preset address bus (This will clobber any preceding OUTH actions!)
wxpin sp_fast, #WEPIN 'sysclock/2 pulse period (registered pin makes it appear two clocks after instruction)
wypin #1, #WEPIN 'start the WE smartpin pulses on next period
setbyte outa, pa, #0 'write data byte to data bus
outl #CEPIN 'release SRAM
_ret_ wxpin #1, #WEPIN 're-prime the smartpin
In the case of single reads/writes it makes sense to pull them out as special cases and and byte bang, especially writes. I'll probably optimize my own code later with this in mind. No need to use the streamer for single writes/fills, only bursts from HUB. Single byte reads are probably the same. Word and Long reads might not benefit so much there due to latency, TBD.
The problem with that is it adds extra handling in all the other routines that still use the smartpin. And it needs a WRPIN in the single byte routine too.
EDIT: Oh, are you looking at the SETBYTE replacing the XINIT? I thought you meant bit-bashing the control/clock pins.
You've got something bazaar going on with that WYPIN c, clkpin and the WAITX. I don't see any support around it for starting in phase with the data. It must be a large period. If so then you're fluking the period timer reset without explicitly doing so.
Eg: Here's my latest linear write using the dynamic period changes to replace DIRL/H. It still requires three WXPINs for managing phase timing.
sram_wrbytes
rdfast fifo_nb, hub_addr 'start the FIFO
mov outa, sr_addr 'preset address bus, includes CE (This will clobber any preceding OUTH actions!)
wxpin #8, #WEPIN 'tuned compensation delay, stretches the first period
wxpin sp_fast, #WEPIN 'sysclock/2 pulse period (registered pin makes it appear two ticks after instruction)
wypin length, #WEPIN 'start the WE smartpin pulses on next period
setword stm_rf8, length, #0 'count of streamer transfer cycles
xinit stm_rf8, #0 'start the data, takes two ticks to start
rep @.rloop, length 'cycle through the addresses
add outa, srcmd_add 'sr_addr + 1
.rloop
outl #CEPIN 'release SRAM
_ret_ wxpin #1, #WEPIN 're-prime the smartpin
@evanh said:
You've got something bazaar going on with that WYPIN c, clkpin and the WAITX. I don't see any support around it for starting in phase with the data. It must be a large period. If so then you're fluking the period timer reset without explicitly doing so.
I'm yet to test and it might need some alterations... the general approach worked for HyperRAM reg writes though IIRC.
Update: Just looked at some of my HyperRam write setup code... I do see some extra instructions prior to reset the clock smart pin and setup its clock rate and we would still need something to do that sort of work too, so the earlier write snippet sample is certainly incomplete...it was mainly posted above to show the idea of not using the streamer for simple writes/fills, only for bursts.
drvl cspin 'active chip select
drvl datapins 'enable the DATA bus
fltl clkpin 'disable Smartpin clock output mode
wxpin #2, clkpin 'configure for 2 clocks between transitions
drvh clkpin 'enable Smartpin
setxfrq xfreq2 'setup streamer frequency (sysclk/2)
waitx clkdelay 'odd delay shifts clock phase from data
xinit ximm4, addrhi 'send 4 bytes of addrhi data
Write burst timing seems to be good. I am copying a 10 byte pattern in HUB RAM that is $00ff00ff etc, to address 0 in a simulated SRAM with my driver. Only data bit 0 is shown and address bits 0-3.
Okay, cool, the DIRL/H pair sets it up consistently.
I did my darnedest to avoid using WAITXs in the write routines, but I can see the attraction. Certainly useful for handling the I/O latencies in the read routines.
Yes the waitx allows tuning on reads. As you may have guessed I'm somewhat less worried about every single clock being shaved in every single case, though that would be the way to go when directly coupled to the COG in order to try to keep latency at absolute minimum. In my case it's more functionality and reliability/consistency I'm after. With memory sharing, and real time accesses from video or audio COGs, the latency can get swamped anyway by other COGs transfers and the polling service time etc. So even if you shave it down a cycle or two, it'll only speed things up by fractions of a percent, it's not even noticeable. The transfer rate becomes the main factor then. In this case with SRAM in my driver it's sysclk/2 for reads, and sysclk/4 for writes (though with tweaking and for the smaller P2 frequencies we might be able to boost writes to sysclk/2 and not violate SRAM WE pulse width). I'm still figuring single writes and fills out, they are a little harder to get correct vs the reads.
@evanh said:
I'd use a preprogrammed PAL/CPLD chip as the external counter. Then it can be placed on the 8-bit databus as well. With this arrangement the entire address would be loaded into it a byte at a time. It does give you much lower access latency and avoids the refresh complications of PS(D)RAMs.
Could get creative with features like single byte sized address updates packed into the CPLD.
I've not read all the thread but, this is the same idea that came to me after @aaaaaaaargh 's proposed SRAM.
With a small CPLD (PAL is not enough) you can maintain 8 bit wide (multiplexed) data/address bus for up to 16MB (8x 2Mx8 chips) plus 2 maybe 3 signals more.
For both random and bursts read/write the overhead will be 3 cycles to pass the 24bit address. No worries with refresh nor memory boundaries.
While the read/write (ie data transfers) needs to respect SRAM's timings, the address and eventual commands can be transmitted significantly faster because the CPLD will latch the information for the SRAM.
With eg. 100MHz P2 clock (10ns period) you can eg transmit command and 3 bytes address in 40ns (sysclock/1) and then exchange data at 10ns/byte (sysclock/1).
With eg. 200MHz P2 clock (5ns period) you can eg transmit command and 3 bytes address in 20ns (sysclock/1) and then exchange data at 10ns/byte (sysclock/2).
With eg. 250MHz P2 clock (4ns period) you can eg transmit command and 3 bytes address in 16ns (sysclock/1) and then exchange data at 8ns/byte (sysclock/2).
The commands can be eventually passed over the bus using a dedicated signal thus reducing to 3 bytes + data the transaction. If the design is limited to 8MB (4x 2Mx8chips) then the MSB address bit can become the switch between command and operation mode again fixing the data preamble to 3 bytes.
The CPLD can also eventually address/adjust some signal delay requirements should it be needed.
All that talk of fpga, pal, latches and clpd got me thinking that sort of thing might be a task for those old Propeller 1 chips lying around... probably not enough pins, though
@dMajo,
I think in practice, it's probably going to be a little bit slower than anticipated. If the P2 needs to generate and send a clock signal or strobe pulse etc to some external CPLD for latching purposes, turns out you can only generate a clock at sysclk/2 frequency from the P2 pins. Data can come out of the pins at sysclk/1 rates however using the streamer. Also you probably don't want the address and strobe/clock edge changing at the same time creating a race condition unless you've got a very well managed skew with the control signals vs address bus pins and know which one will arrive first into the latches or CPLD etc..
I do think there is probably scope for a multiplexed variant of SRAM in these drivers at some point though, to help save pins. How universal it could be is up for grabs. Needs to be some well considered/generic approach.
Or three or four generic approaches. There's nothing wrong with a Ferarri, and nothing wrong with a motorhome, but trying to mix the two is not going to end well either way. S.
@aaaaaaaargh said:
All that talk of fpga, pal, latches and clpd got me thinking that sort of thing might be a task for those old Propeller 1 chips lying around... probably not enough pins, though
not only the pins: you need 21 address, 8 data, 1 OE, 1 WE and 1 CS per chip on the ram side + the additional P2 signals: 8 bit bus and a few control lines (2/3) on P2 side.
Also by overclcolking the P1 to 100MHz you have a minimal clock period of 10ns which will grow because you need instructions to read from P2 and drive the RAM.
But a Max10 10M02SCU169I7G or 10M02SCE144I7G will do all and potentially have pins and logic to offload the P2 for some other tasks.
@rogloh said:
@dMajo,
I think in practice, it's probably going to be a little bit slower than anticipated. If the P2 needs to generate and send a clock signal or strobe pulse etc to some external CPLD for latching purposes, turns out you can only generate a clock at sysclk/2 frequency from the P2 pins. Data can come out of the pins at sysclk/1 rates however using the streamer. Also you probably don't want the address and strobe/clock edge changing at the same time creating a race condition unless you've got a very well managed skew with the control signals vs address bus pins and know which one will arrive first into the latches or CPLD etc..
I do think there is probably scope for a multiplexed variant of SRAM in these drivers at some point though, to help save pins. How universal it could be is up for grabs. Needs to be some well considered/generic approach.
I think all depends on the CPLD. Eg. The Max10 have an internal PLL , that can take its clock out of the P2 X2 pin, amd IIRC warranties its jitter within 15ps. I think you do not need any strobe output. Regarding the address latches, being these internal I also not see "race condition" issues that can't be solved (delayed) internally.
And now that I think a bit more over it, up to a certain burst length you can also use the Max10's internal ram to buffer the SRAM transactions because it can be used as a dualport operated from different clock domains on the two sides. So the P2 can always operate on sysclock/1 regarding of its frequency.
If the P2 is running quicker (above the SRAM access time) the write operation will have some busy time at the end while the read will need some busy(free) time between address setup and data read. A signal can indicate transaction (SRAM) busy.
If the P2 is running slower than there is no issues, the transaction will never be busy.
For long bursts, beyond the Max10's ram capability, what I wrote before (considering the synced PLLs) still apply.
The only drawback is that the SRAM solution requires a CPLD to not be pin-hungry, but the positive side is that this solution, with a small battery powering the SRAMs, can became also a safe storage.
I imagine if cost and time are not a major consideration a CPLD + SRAM would be one way to reduce latency (to a small degree) if you also need to reduce the pin count. Thankfully the P2 still has a reasonable number of total IO pins. When you get down to 11 pins, HyperRAM is probably still the way to go for price and performance vs CPLD + SRAM, but it has higher latency. This thread is about video buffers from RAM so latency is not the most important thing. I do know it could be more important for emulators, or cache reading etc where only one COG is ever accessing the memory.
I looked again at 8 bit SDRAM. It actually requires 3 less pins for 16MB vs the 2MB SRAM solution if you stick with 8 bit data widths. So 29 pins for 16MB of SDRAM. Performance would be quite similar to the SRAM and data would be clocked at sysclk/2. I think the P2 could do SDRAM writes quite easily, but getting reads working would depend on the P2 input timing happening around the output clock edge, with data being clocked into the P2 in the correct time window when the data is valid. For a P2@250MHz it is valid 5.4ns after the last edge and 2.5ns after the next edge making a window of about 5ns when you need to clock it in.
Because I already have the burst size controls supported in the driver from the HyperRAM and PSRAM implementations this means that refresh should not be an issue and could still remain hidden from the mailbox client - we can just put it into auto refresh mode after each burst. I'm sort of tempted to attempt this at some point. It might be one of the cheapest solutions as this type of RAM should be a commodity. But if 4 PSRAM devices from China are even cheaper that would be better performing, and only needing only 18 pins for effectively sysclk/1 MB/s transfers with 16 bits.
Rogloh if you want to have a go at SD i'm thinking of revving up an old P1V board that had a 54 pin sdram.
Alternatively current board can do 28 but not really 29 pins easily, I guess it would just halve its addressable size?
I need to mull it over more before I commit to it. But I am somewhat interested to see if it can be done. I know if I make up my own board with a chip on it, I may run into issues on the P2-EVAL with that P28-P31 issue. Perhaps I'll mod one of my P2-EVALs to try to use a 20MHz oscillator feeding into XI if that is known to solve it. It would make the EVAL board a bit ugly with an oscillator flapping in the breeze (or maybe I can dead bug it and glue it down and tap power from elsewhere).
I've think I've got single and burst reads working now with the correct timing offsets for SRAM. Tested at 4MHz with my logic analyzer but the code also has an adjustable delay to compensate for higher frequencies. I think to go much further with SRAM I now need to make up a board or fit some DIP SRAM into the JonnyMac board with the P2-Edge and try to clock it faster.
I used a second COG to output a counting pattern of 4 repeated bytes before incrementing the the data pins and I then read a burst into my scratch RAM area of HUB for dumping, and the edges line up how I want with the address transitions. Data is effectively sampled before the address changes but I can tweak this further if needed.
Update: Just found a couple of SRAMs lying in my parts bin I can probably hack up for a slightly faster (but not full speed test as they aren't 8ns parts)
1x 32 pin 0.6" DIP Samsung KM681000BLP-7L (128kB 5V CMOS but I think I've run it at 3.3V before)
2x 36 pin SOJ Cypress CY7C1049-12 (512kB 3.3V) This needs a PCB but can run up to 80MHz (160MHz P2).
Just realized this driver could probably eventually support reading EPROMs too (in a Read-Only mode) for any old school aficionados. But the access time will be a limiting factor if you clock it too fast. I'd need some wait states or slower clock options for that....LOL.
Just wired up this abomination with that DIP SRAM I had laying about! Whether it works or not is yet to be determined. Maybe I can run it at 10-20MHz or something, LOL. I think the access time for this old part is only 70 or 55ns which is very slow, plus I'm undervolting it too. Thankfully the any address order and any data order helped a bunch with the wiring.
Comments
Looks like your cepin is setup as 3 bits too high above pin 8 and immediately clobbering the prior outh instruction.
Instead of this:
maybe try this:
EDIT: On second thoughts, scrap that. You seem to have your address bus shifted up some bits. I have mine starting at bit zero. However the first mov outa, sr_addr doesn't keep pin 8 set to 1 which I think is the issue.
Yep, that was it. And I'd only just moved it into that position in the source too. It should have rung an alarm much earlier. These things happen. PS: I've been trying to optimise again.
PPS: Sigh, now I've got an ugly NOP sitting there padding the XINIT.
PPPS: I do have the smartpin sequencing down to four instructions now. That makes me feel better.
If you want a 'latch with party tricks', then a CPLD would do (I wrote some with 22V10s and XC9500s, ages ago) but that's overkill, really. A 74xx373 8-bit latch (or, more likely, two or three of them) could each latch off their share of the address from the data bus.
The tradeoff (it's always something!) is that you need more bus cycles to load each address latch, and another pin signal for each latch, but the whole thing could come in at 13 pins.
(DB0-7, ALE's 0,1,2, /[ram]WR, /[ram]RD. /CS can be grounded if you're only using one SRAM chip).
In the application I currently have in mind (I've got several more for the P2 - Those smart-pins!) my external RAM access is going to be very sequential - Write a lot of data while "offline" (don't care about speed), then reset the address counter to zero, and start reading "online" (fast enough to feed the algorithm(s).) Random access doesn't really come into it.
Still, I imagine for most it will, so lots of ALE signals to pull address bits off the data bus are not a bad idea at all!
Have fun, S.
ETA a bit more: Since the P2 does pretty much everything in 32 bits, and at the moment we're assuming an eight-bit-wide SRAM, using a 16V8 as a six-bit latch with a two-bit counter on the low bits would cheerfully save quite a few bus cycles when reading or writing long values. Latches with party tricks...
LOL, "Premature optimization is the root of all evil", I was once told by a colleague.
A problem with ALE type latches is that once you wrap beyond the end of the 8 bit address group you've wired in directly (for example), then you need to update the latch again (or pulse a counter perhaps). This means the burst transfer has to stop right at this point and do something else. It's okay if you've always coded for this specific memory page size/architecture, but to keep it generic you have to compute where to stop the transfer and pulse a clock or load the next latch. It wouldn't (easily) suit a general purpose software driver designed to stream arbitrarily large bursts at any address but only a specifically coded one for that HW implementation.
Indeed.
Or you need latches with better party tricks - Imagine instead each latch is a loadable counter (74xx163, four bits a pop?*) and their 'count' pin is hooked up to to your 'burst read', so you start by loading up the start address and the "latches" and the address count up with every read/write pulse.
Understood, we're getting tricky here, saving pins by 'being clever in external hardware' and there are other tradeoffs as well. Seems to me, though, that a driver could accept 'How big is my page?' as an argument, and use that to 'stop and reload address'. Dunno myself - I'm not a driver writer.
Have fun, S.
*A 74F283 might be better, because otherwise the carry out from each counter could lollygag their way up for some time before arriving at the top bits
Read code is complete for full feature integration into my driver (untested, delay code might need adjustment). Gonna work on the write code then test with simulated SRAM timing until some board is available.
This code does not use special pin control optimizations but allows control pins to be put anywhere and the data bus just needs to be on any 8 pin group boundary. It also allows multiple banks with different CS pins. The only constraint is the address bus group needs to be located with it's base pin (A0) on either P0 (port A) or P32 (port B ), this makes the address setup/wrapping/masking simple and doesn't require extra shifting.
I've optimised the routine for single byte write down to six instructions. Just three for the smartpin.
The two WXPINs sort of provide an XINIT effect around the WYPIN.
The details are a little more subtle:
In the case of single reads/writes it makes sense to pull them out as special cases and and byte bang, especially writes. I'll probably optimize my own code later with this in mind. No need to use the streamer for single writes/fills, only bursts from HUB. Single byte reads are probably the same. Word and Long reads might not benefit so much there due to latency, TBD.
The problem with that is it adds extra handling in all the other routines that still use the smartpin. And it needs a WRPIN in the single byte routine too.
EDIT: Oh, are you looking at the SETBYTE replacing the XINIT? I thought you meant bit-bashing the control/clock pins.
Yeah using setbyte for writes/fills instead of the streamer.
E.g, something like this (for words in this case)
You've got something bazaar going on with that WYPIN c, clkpin and the WAITX. I don't see any support around it for starting in phase with the data. It must be a large period. If so then you're fluking the period timer reset without explicitly doing so.
Eg: Here's my latest linear write using the dynamic period changes to replace DIRL/H. It still requires three WXPINs for managing phase timing.
I'm yet to test and it might need some alterations... the general approach worked for HyperRAM reg writes though IIRC.
Update: Just looked at some of my HyperRam write setup code... I do see some extra instructions prior to reset the clock smart pin and setup its clock rate and we would still need something to do that sort of work too, so the earlier write snippet sample is certainly incomplete...it was mainly posted above to show the idea of not using the streamer for simple writes/fills, only for bursts.
Write burst timing seems to be good. I am copying a 10 byte pattern in HUB RAM that is $00ff00ff etc, to address 0 in a simulated SRAM with my driver. Only data bit 0 is shown and address bits 0-3.
Okay, cool, the DIRL/H pair sets it up consistently.
I did my darnedest to avoid using WAITXs in the write routines, but I can see the attraction. Certainly useful for handling the I/O latencies in the read routines.
Yes the waitx allows tuning on reads. As you may have guessed I'm somewhat less worried about every single clock being shaved in every single case, though that would be the way to go when directly coupled to the COG in order to try to keep latency at absolute minimum. In my case it's more functionality and reliability/consistency I'm after. With memory sharing, and real time accesses from video or audio COGs, the latency can get swamped anyway by other COGs transfers and the polling service time etc. So even if you shave it down a cycle or two, it'll only speed things up by fractions of a percent, it's not even noticeable. The transfer rate becomes the main factor then. In this case with SRAM in my driver it's sysclk/2 for reads, and sysclk/4 for writes (though with tweaking and for the smaller P2 frequencies we might be able to boost writes to sysclk/2 and not violate SRAM WE pulse width). I'm still figuring single writes and fills out, they are a little harder to get correct vs the reads.
I've not read all the thread but, this is the same idea that came to me after @aaaaaaaargh 's proposed SRAM.
With a small CPLD (PAL is not enough) you can maintain 8 bit wide (multiplexed) data/address bus for up to 16MB (8x 2Mx8 chips) plus 2 maybe 3 signals more.
For both random and bursts read/write the overhead will be 3 cycles to pass the 24bit address. No worries with refresh nor memory boundaries.
While the read/write (ie data transfers) needs to respect SRAM's timings, the address and eventual commands can be transmitted significantly faster because the CPLD will latch the information for the SRAM.
With eg. 100MHz P2 clock (10ns period) you can eg transmit command and 3 bytes address in 40ns (sysclock/1) and then exchange data at 10ns/byte (sysclock/1).
With eg. 200MHz P2 clock (5ns period) you can eg transmit command and 3 bytes address in 20ns (sysclock/1) and then exchange data at 10ns/byte (sysclock/2).
With eg. 250MHz P2 clock (4ns period) you can eg transmit command and 3 bytes address in 16ns (sysclock/1) and then exchange data at 8ns/byte (sysclock/2).
The commands can be eventually passed over the bus using a dedicated signal thus reducing to 3 bytes + data the transaction. If the design is limited to 8MB (4x 2Mx8chips) then the MSB address bit can become the switch between command and operation mode again fixing the data preamble to 3 bytes.
The CPLD can also eventually address/adjust some signal delay requirements should it be needed.
All that talk of fpga, pal, latches and clpd got me thinking that sort of thing might be a task for those old Propeller 1 chips lying around... probably not enough pins, though
@dMajo,
I think in practice, it's probably going to be a little bit slower than anticipated. If the P2 needs to generate and send a clock signal or strobe pulse etc to some external CPLD for latching purposes, turns out you can only generate a clock at sysclk/2 frequency from the P2 pins. Data can come out of the pins at sysclk/1 rates however using the streamer. Also you probably don't want the address and strobe/clock edge changing at the same time creating a race condition unless you've got a very well managed skew with the control signals vs address bus pins and know which one will arrive first into the latches or CPLD etc..
I do think there is probably scope for a multiplexed variant of SRAM in these drivers at some point though, to help save pins. How universal it could be is up for grabs. Needs to be some well considered/generic approach.
Or three or four generic approaches. There's nothing wrong with a Ferarri, and nothing wrong with a motorhome, but trying to mix the two is not going to end well either way. S.
not only the pins: you need 21 address, 8 data, 1 OE, 1 WE and 1 CS per chip on the ram side + the additional P2 signals: 8 bit bus and a few control lines (2/3) on P2 side.
Also by overclcolking the P1 to 100MHz you have a minimal clock period of 10ns which will grow because you need instructions to read from P2 and drive the RAM.
But a Max10 10M02SCU169I7G or 10M02SCE144I7G will do all and potentially have pins and logic to offload the P2 for some other tasks.
I think all depends on the CPLD. Eg. The Max10 have an internal PLL , that can take its clock out of the P2 X2 pin, amd IIRC warranties its jitter within 15ps. I think you do not need any strobe output. Regarding the address latches, being these internal I also not see "race condition" issues that can't be solved (delayed) internally.
And now that I think a bit more over it, up to a certain burst length you can also use the Max10's internal ram to buffer the SRAM transactions because it can be used as a dualport operated from different clock domains on the two sides. So the P2 can always operate on sysclock/1 regarding of its frequency.
If the P2 is running quicker (above the SRAM access time) the write operation will have some busy time at the end while the read will need some busy(free) time between address setup and data read. A signal can indicate transaction (SRAM) busy.
If the P2 is running slower than there is no issues, the transaction will never be busy.
For long bursts, beyond the Max10's ram capability, what I wrote before (considering the synced PLLs) still apply.
The only drawback is that the SRAM solution requires a CPLD to not be pin-hungry, but the positive side is that this solution, with a small battery powering the SRAMs, can became also a safe storage.
I imagine if cost and time are not a major consideration a CPLD + SRAM would be one way to reduce latency (to a small degree) if you also need to reduce the pin count. Thankfully the P2 still has a reasonable number of total IO pins. When you get down to 11 pins, HyperRAM is probably still the way to go for price and performance vs CPLD + SRAM, but it has higher latency. This thread is about video buffers from RAM so latency is not the most important thing. I do know it could be more important for emulators, or cache reading etc where only one COG is ever accessing the memory.
I looked again at 8 bit SDRAM. It actually requires 3 less pins for 16MB vs the 2MB SRAM solution if you stick with 8 bit data widths. So 29 pins for 16MB of SDRAM. Performance would be quite similar to the SRAM and data would be clocked at sysclk/2. I think the P2 could do SDRAM writes quite easily, but getting reads working would depend on the P2 input timing happening around the output clock edge, with data being clocked into the P2 in the correct time window when the data is valid. For a P2@250MHz it is valid 5.4ns after the last edge and 2.5ns after the next edge making a window of about 5ns when you need to clock it in.
Because I already have the burst size controls supported in the driver from the HyperRAM and PSRAM implementations this means that refresh should not be an issue and could still remain hidden from the mailbox client - we can just put it into auto refresh mode after each burst. I'm sort of tempted to attempt this at some point. It might be one of the cheapest solutions as this type of RAM should be a commodity. But if 4 PSRAM devices from China are even cheaper that would be better performing, and only needing only 18 pins for effectively sysclk/1 MB/s transfers with 16 bits.
Here's a sample SDRAM.
https://au.mouser.com/datasheet/2/198/42-45S81600F-16800F-258526.pdf
Rogloh if you want to have a go at SD i'm thinking of revving up an old P1V board that had a 54 pin sdram.
Alternatively current board can do 28 but not really 29 pins easily, I guess it would just halve its addressable size?
I need to mull it over more before I commit to it. But I am somewhat interested to see if it can be done. I know if I make up my own board with a chip on it, I may run into issues on the P2-EVAL with that P28-P31 issue. Perhaps I'll mod one of my P2-EVALs to try to use a 20MHz oscillator feeding into XI if that is known to solve it. It would make the EVAL board a bit ugly with an oscillator flapping in the breeze (or maybe I can dead bug it and glue it down and tap power from elsewhere).
I've think I've got single and burst reads working now with the correct timing offsets for SRAM. Tested at 4MHz with my logic analyzer but the code also has an adjustable delay to compensate for higher frequencies. I think to go much further with SRAM I now need to make up a board or fit some DIP SRAM into the JonnyMac board with the P2-Edge and try to clock it faster.
I used a second COG to output a counting pattern of 4 repeated bytes before incrementing the the data pins and I then read a burst into my scratch RAM area of HUB for dumping, and the edges line up how I want with the address transitions. Data is effectively sampled before the address changes but I can tweak this further if needed.
Update: Just found a couple of SRAMs lying in my parts bin I can probably hack up for a slightly faster (but not full speed test as they aren't 8ns parts)
1x 32 pin 0.6" DIP Samsung KM681000BLP-7L (128kB 5V CMOS but I think I've run it at 3.3V before)
2x 36 pin SOJ Cypress CY7C1049-12 (512kB 3.3V) This needs a PCB but can run up to 80MHz (160MHz P2).
Just realized this driver could probably eventually support reading EPROMs too (in a Read-Only mode) for any old school aficionados. But the access time will be a limiting factor if you clock it too fast. I'd need some wait states or slower clock options for that....LOL.
Just wired up this abomination with that DIP SRAM I had laying about! Whether it works or not is yet to be determined. Maybe I can run it at 10-20MHz or something, LOL. I think the access time for this old part is only 70 or 55ns which is very slow, plus I'm undervolting it too. Thankfully the any address order and any data order helped a bunch with the wiring.
Heh, that'll have quite the spring in its step.