So it's been many months since I last looked at any of this and I am not sure if anyone is still playing much with the P1V, but today I finally got the chance to think more about how to get the 8MB SDRAM on board the BeMicro MAX10 hooked up into a single COG for read/write access at normal hub speed. I rediscovered a sample SDRAM Verilog controller I had put together last year for this work and even managed to get it to compile into a P1V codebase on Quartus II today. Actually I'm feeling a lot like a total FPGA noob now since I haven't touched those FPGA tools for months and seem to have forgotten heaps unfortunately.
At the moment I am bit too fearful to hook this controller logic directly into the actual SDRAM chip on the board in case I damage it with incorrect timing driving data out at the wrong times so I'm hoping to temporarily bring out the SDRAM control/data pins to some I/O pins on the board and scope them out to check if the logic is looking correct. There are no high end tools at home just a lowly Rigol scope and Saleae logic probe so I'm going to have to clock this design down to probably about 10MHz instead of 80MHz and see just what signals can be observed simultaneously and then match them up with the intended sequence. Then once I'm a bit more game with the overall design I'll set them back to connect to the real SDRAM chip pins and give it a try and see if it smokes. This is not the way you would normally want to do things but at the same time I don't know how to setup proper simulations etc.
As far as the SDRAM controller design goes it seems to only use about 88 LE's so far and I am happy enough with that. It may bump up a bit if I find I need to register some inputs or outputs but I don't imagine that would add more than 2x this number. Even then I'm confident I can fit it into my intended design with the SRAM and graphics controller I'd already worked on before and still leave enough room for a few other nice things like the HW multiply and some previous address mode enhancements and auto incrementing pointers.
There will be two 32 bit random read or write accesses available per hub cycle, one from a single fixed COG, the other from a basic refresh counter and ultimately a future type of DMA controller. I'd really like this controller to be able to stream data in the background from SDRAM into graphics SRAM in the spare hub slot left over for that purpose, perhaps for supporting transparent sprites and block copies etc. That will free up the main COG to just setup a transfer list in memory and the HW will go do the transfers automatically after that.
Initially this is intended for the BeMicro MAX 10 but I also have a DE0-Nano that also has some SDRAM on board which I (or anyone else) might try to port to if I manage to get something working. Time will tell. I hope I don't get too discouraged if I fail to get the tools doing what I want. That's probably the frustrating part given I have hardly used them much and simply don't know the process to optimize timing. I know precisely what I want and pretty much where it would tap into the P1V but now have to figure out how to get these tools to make it so.
With any luck I might just get something going but as I'm hardly a FPGA guru or high speed HW guy this will be a huge challenge for me with the 80MHz timing and SDRAM clocking. Though in the end I did get my 2MB SRAM working at the full hub speed with 8 transfers per hub window so there is some hope yet.
This is really cool!
BTW what is the width of the sdram 32bits?
Perhaps you could map some of the sdram to a single cog as extended cog ram. This would give full cog speed code and/or data space (not register space tho).
The SDRAM width is only 16 bits, but you can do double read or write bursts to get the 32 bits over two successive clock cycles. Initially I just plan to run it at 80MHz locked to the hub and deliver a single read or write per hub for a single COG. With it's pipelining, from a COG perspective accessing SDRAM may take 8 clocks instead of the normal 7 we get with internal hub access sweet spots, but I think we can still sustain the 5 MIPs (assuming that prior auto increment feature I did is also present).
At 80MHz it takes about 6 clocks to activate an SDRAM row, read two 16 bit values, and precharge it back. In the 16 clock cycles per hub without bank overlap that gives us two accesses, and leaving 4 more transfer cycle opportunities free. Perhaps it would be nice to play around with this try to sustain hub exec at full rate by reading larger bursts into some sort of hub exec FIFO with these extra cycles, though the random jumps between rows will cause stalls and refresh also complicates things. It might be interesting to investigate that later however because there is probably some extra scope to try to use some overlapping dedicated bank access for hub exec and this might get the 4x32 bits out of it in each hub cycle to yield 20MIPs on linear code while still leaving some reasonably useful amounts of general purpose external RAM for other purposes such as graphics data. Have to think more about it...
Aha, all those extra clocks for sdram access. Forgot about them
I might look at building an SRAM board x32 bits. Could access this full speed per cog instruction cycle so I could map it all to extended cog ram. This could then work as a data logger or snooper. I have been snooping USB but it's a problem to store a decent trace length with P1. Hub imposes some limitations in timing. I could add a special instruction to save and increment. This could make a nice logic monitor cheaply.
I published my SRAM designs in the thread below so you are free to use them too. I also planned to make something like this for the DE-0 nano which has easier connections via the IDC header instead of the Samtec thing.
At 80MHz it takes about 6 clocks to activate an SDRAM row, read two 16 bit values, and precharge it back. In the 16 clock cycles per hub without bank overlap that gives us two accesses, and leaving 4 more transfer cycle opportunities free. Perhaps it would be nice to play around with this try to sustain hub exec at full rate by reading larger bursts into some sort of hub exec FIFO with these extra cycles, though the random jumps between rows will cause stalls and refresh also complicates things.
If there are spare cycles, you could make a tiny fifo of that size, that auto-fills and if the next address is in the fifo, you can give that faster.
Gives higher in-line speed with no reduction of random access, but some small jitter.
This type of streaming / in-line reading is also suited to QuadSPI, and DDR QuadSPI etc.
Ideally, a CPU would have a short skip opcode ( was there a Natsemi CPU that did that ? )
On P1, the conditional flags could be use more aggressively for in-line results.
I see prices are up for Spansion HyperFLASH
1+ $3.42 at Avent, 100MHz 128Mbit, only in 24-BALL BGA
Cluso, so you really have me intrigued about the possibility of allocating another SDRAM bank to the hub exec function on P1V.
I looked into this with the SDRAM and I am starting to think perhaps it might be possible for all the following to coexist in the overall hub cycle. If this was achievable for MAX10 P1V it would be awesome.
3 SDRAM accesses per 16 clock cycle hub loop :
- 4x32 bit read accesses on a subset of SDRAM banks for preloading a FIFO with PASM CODE instructions at the rate of 20 Million longs per second. This enables an instruction prefetcher - however handling stalls and periodic refreshes of this bank would need to be developed further.
- 1x32 bit read or write data access from the main COG for accessing extended hub DATA in the remaining banks.
- Finally an additional 1x32 bit read or 16 bit DMA write access to the same non hub exec memory banks and this can also refresh these banks either when the DMA is idle, and/or have the ability to interrupt DMA in progress with the refresh. This background refresh allows the COG's SDRAM DATA access to remain deterministic, hitting it every hub cycle and keeping refresh hidden.
It's a real jigsaw puzzle to make it fit with the various SDRAM timing restrictions but for CAS latency of 2, read bursts of 2, write bursts of 1 and using some auto-precharge this sequence below may possibly work...depending on auto pre-charge constraints. 3 accesses are made, one using the hubExec bank, and two from other banks, the second write from the DMA has to be limited to 16 bits however. This may not be a problem as I am intending to mainly use it for reads of graphics data to then be written into my graphics SRAM or streaming audio out to an i2S interface, and if it is only used for the refresh purposes instead this wouldn't matter anyway.
I'm not going to think about it further at the moment, but it is in the back of my mind to consider this later if I manage to get the SDRAM going....
Clk Command DQ bus
0 ActivateBank2 HubExecData
1 NOP HubExecData
2 Read2/Write2L Write2LData - if writing
3 NOP/Write2H Write2HData - if writing
4 ReadHubExec Read2LData - if reading
5 NOP Read2HData - if reading
6 ReadHubExec HubExecData
7 ActivateBank3 HubExecData
8 NOP HubExecData
9 Read3/NOP HubExecData
10 ActivateHubBank
11 NOP/Write3 Read3LData or Write3LData if writing
12 ReadHubExec Read3HData - if reading
13 NOP
14 ReadHubExec HubExecData
15 NOP HubExecData
One thing about the P123 A7/A9's is their DIL GPIO headers line up with the headers of the DE0-Nano.
OzPropDev has a neat P2 80 MHz logic analyzer with 1 pixel per 80MHz lock onto its svga (xga?) screen. You could remove a couple of power pins and carry the rest of data through
SRAM is lower density and more expensive than SDRAM but doesn't have all the timing constraints.
I could use 4 @ 512Kx8 (2MB) 10ns SRAMs for hub and/or extended cog ram. Of course it does require 19 address pins, 32 data pins, plus 4 sets of CE, OE, WE (or get them together), for a maximum of 63 pins.
Alternately, I could use 1 @ 512Kx8 (512KB). This only requires 19 address pins, 8 data pins, and CE, OE and WE, giving 30 pins. It is possible to share I2C for booting. It is also possible to share pins with an SD card, although for what I am talking about here, I wouldn't do that because I want full SRAM access.
So, with a P1V (assume 100MHz as it is easier to work thru timing) it would be possible to that byte-wide SRAM and read/write one byte per system clock, making 4 bytes per instruction. This would work nicely for just one cog, but would require a hub style slot arrangement to share amongst a number of cogs.
Hub ram is 1:16, so its 2 clocks per cog slot. Therefore you would require 2 rotations to read/write a long, but only 1 slot for read/write a byte or word.
Certainly a couple of nice possibilities here
FWIW I need to run the P1V at 96MHz, else I have to run at 48MHz. I haven't even looked to see if this is possible without changing the xtal. Anyone know the answer???
Hi Lachlan, actually I don't quite get it. Were you suggesting some memory board could be made common to work with both platform types? If so, yeah I think the power and ground pins would be different to the DE-0 nano, unless the A7/A9 boards pinout changed since the photo of the rev A board shown in the sticky.
Also I am finding it best if the memory bus has its own set of pins that leave the rest of the P1V (or P2) pins free for other purposes. That's where the BeMicro MAX 10 really shines with its very low cost and high pin count. However that obviously means you use your own custom FPGA design and not use the final commercial P2 device (if/when it ships).
On the BeMicro MAX10 board for example I plan to have 32 bit Port A with some internal/external peripherals and leave 32 bin Port B totally free for general purpose expansion use, and run my separate SRAM and SDRAM buses into the hub as well. That's all likely possible. The only real limitation is the 3 COGs, but the built in video and one day hopefully audio does help alleviate that. A larger MAX10M16 device would probably have been ideal and yielded those 5 extra COGs, it was such a pity they didn't populate that part instead even if would have pushed the price up accordingly. If you were to spin your own board instead of using the off the shelf development MAX10 board one could always use such a chip I guess. It's a nice match for P1V based designs.
I looked at the BeMicroA9 board and this cuts down on the free pins somewhat but could still be used for LVDS displays. My little SRAM expansion board probably fits that too as it has the same number and placement of GPIO pins on that 80 way connector.
On the DE0-nano, the two groups of IDC header pins lends themselves to supporting a nice fast 32 bit SRAM interface still leaving the lower pins available for some GPIO, but there are not a lot of I/O left over. For some applications there would be enough.
....
- 4x32 bit read accesses on a subset of SDRAM banks for preloading a FIFO with PASM CODE instructions at the rate of 20 Million longs per second. This enables an instruction prefetcher - however handling stalls and periodic refreshes of this bank would need to be developed further.
- 1x32 bit read or write data access from the main COG for accessing extended hub DATA in the remaining banks.
- Finally an additional 1x32 bit read or 16 bit DMA write access to the same non hub exec memory banks and this can also refresh these banks either when the DMA is idle, and/or have the ability to interrupt DMA in progress with the refresh. .
That would be impressive if it can fit.
I thought SDRAMS needed more preamble stuff, and then could stream ?
What's the part number of the SDRAM used ?
That would be impressive if it can fit.
I thought SDRAMS needed more preamble stuff, and then could stream ?
What's the part number of the SDRAM used ?
Yeah if realizable it would be a thing of beauty with both 20MIPs peak hub exec and a much larger hub DATA RAM space too albeit for a single (main) COG.
The BeMicro MAX 10 schematic shows IS42S16400J-7TL is the SDRAM used. This device is 8MB total using 4 x 1MBx16 bit wide banks. The DE0-nano has a larger 4 bank 32MB part if memory serves me but also 16 bits wide.
For SDRAM reads you basically need to :
1 activate a row of some bank
2 wait tRCD before the row is ready to be read
3 issue the read command (with optional auto-precharge)
4 wait the (tCL) CAS latency before starting to read the data off the bus, usually 2 or 3 clocks
5 read the data burst and optionally issue a precharge command during this time to prepare to close the row of the bank
6 wait some time for the precharge operation to complete (tRP) until you can re-activate some other row in this bank, ensuring you do not do more than 1 complete memory access per bank in less than tRC cycle time.
There are also some other rules about how precharge is timed, aborting bursts, and back to back activate commands etc.
You can also keep a row open for longer than just one read or write burst and do random reads or writes inside it. You can also open other banks and overlap the instructions sent to each bank. When you get clever with interleaving you can start to do some fancy things to keep the sustained output bandwidth high. Natively an 80MHz SDRAM can probably pump out data on every clock cycle if you arrange you operations carefully over different banks. Random accesses usually reduces this rate, as does the need for periodic refresh of each row. Locking the overall sequence to the hub window guarantees some deterministic behaviour and bandwidth.
You mentioned an interest in possibly getting this running on a DE0-Nano. You also mentioned you'd have to slow down to work with existing logic analyser
My point is you can use a P123-A7/A9 to achieve 80 MHz multi channel capture, with a 512kB or 1MB deep hub ram, and OzPropDev has already done most of the heavy lifting making this work.
Physically the DE0 and A7/A9 headers line up, so you just need to remove the conflicting power pins (from a stackthrough 40 pin header which I can give you), and can assign the connected data pins as your 'logic probe points' for seeing and testing the DE0 memory interaction, at full speed.
I know I said I wasn't going to look at this further but I think the following SDRAM command/data sequence fixes the 16 bit problem above. If this can be made to work at 80MHz with CL=2 and a burst size of 2 for both reads and writes I expect it could meet the SDRAM timing sequence requirements to simultaneously yield both:
4x32 bit reads per hub window for preloading a hub exec FIFO from a given SDRAM dedicated bank (or banks). My example below just calls this BANK 1.
AND
2x32 bit read or write opportunities per hub window from the remaining banks. Eg. one for a special large hub RAM enabled COG, and one for background refresh (and a potential DMA engine) so as not to interfere with the large hub data COG's access timing keeping it fully deterministic. I show this as BANK 2 or 3 below, but they could be the same bank, just not BANK 1.
This hub cycle loops around continuously and SDRAM data is read in or output at the appropriate times in the cycle. The appropriate banks are selected based on the address issued. If the associated READs or WRITES are not done in a cycle then they are all NOP commands instead.
CLOCK COMMAND DQ BUS
0 ACTIVATE BANK 2 BANK 1 READ DATA
1 NOP BANK 1 READ DATA
2 READ BANK 2/WRITE BANK 2 WRITE DATA FOR BANK 2 - if writing
3 NOP WRITE DATA FOR BANK 2 - if writing
4 READ BANK 1 BANK 2 READ DATA - if reading
5 PRECHARGE BANK 2 BANK 2 READ DATA - if reading
6 READ BANK 1 BANK 1 READ DATA
7 ACTIVATE BANK 3 BANK 1 READ DATA
8 PRECHARGE BANK 1 BANK 1 READ DATA
9 READ BANK 3/NOP BANK 1 READ DATA
10 ACTIVATE BANK 1
11 NOP/WRITE BANK 3 BANK 3 READ DATA/WRITE DATA FOR BANK 3
12 READ BANK 1 BANK 3 READ DATA/WRITE DATA FOR BANK 3
13 PRECHARGE BANK 3
14 READ BANK 1 BANK 1 READ DATA
15 NOP BANK 1 READ DATA
Update: Damn, found a problem with the timing in clock 13, you can't precharge immediately after a write, apparently you have to wait 2 clocks instead of 1. So this sequence won't work for the second 32 bit write transfer either.
So I have finally been playing around some more this weekend with the P1V and getting my SDRAM driver verilog logic code tested out on a BeMicro MAX10 board. This particular board has 8MB of SDRAM fitted which will make a nice expansion of memory for XMM and/or more graphics data.
My basic SDRAM controller interface has been completed and integrated into the P1V codebase and I'm now testing it. It sequences two 32 bit reads or one 32 bit read and a 32 bit write access over the hub cycle. It also has the necessary byte lane controls for byte/word read and writes and the startup/init sequence for SDRAM upon reset. Right now one of these read cycles is used with a refresh address counter, the other for COG 0. Later I am hoping more complex sequences with multiple banks can be built up to allow much higher performance particularly when clocked faster if that is possible on this board. In fact I may have found a way that in theory may allow 4x32 bit hub exec reads, 16x16 bit video pixel reads, 1x32 bit COG read/write and one refresh cycle all in a single hub window. This would be ideal for my future plans if it's realizable....
The BeMicro MAX 10 board was attached to a basic Saleae logic analyzer and it can now see external memory read and write commands coming out on the GPIO pins I mapped the signals to. One of the issues I have is that this analyzer device is USB2 bus speed limited so I have to clock the P1V down at 10MHz or so to check my sequence is correct before I am confident to speed it up and try it for real on the actual internal SDRAM device instead of GPIO pins for monitoring. But so far I am happy the signals seem to be doing what I wanted on the faked SDRAM bus. I still need get the actual COG access part working fully - right now I'm seeing strange things with what looks like 512 SDRAM reads (loading COG instructions?) at bootup. This may be a side effect of my mapping or some other bug.
For now on the P1V I've mapped the external SDRAM to addresses in the range of $80000000-$FFFFFFFF. The P1V does actually access this range during the booter sequence and trigger my SDRAM driver to do a read right away. At startup it reads the version byte from $FFF9FFFF which normally aliases to $FFFF in the hub ROM without a problem. This is probably fixable by modifying the booter code and it doesn't seem to harm it for now to have it read a different version value, but I'll need to be careful to see if anything else is hitting this address range in normal operation. I think SPIN took another addressing shortcut that I'd fixed up before when I did my 2MB SRAM testing last year.
The timing is very tight if you want it to all fit in a single hub window and so I'm also needing the negative clock edges too. The plan is to have it take 8 cycles so the hub window can still be accessed in a 3 instruction loop. The current code appears to fit in with this limit (just) but I'm hoping it will still work at proper speed and not need additional clocks.
The SDRAM controller logic only takes 96 LEs so there's plenty of room for it to fit in with my existing SRAM and other video driver code too.
Here's a couple of pics of the logic analyzer results showing the SDRAM signal sequence repeating in the hub window issuing ACTIVATE/READ/PRECHARGE commands etc.
Actually Tubular has given me an idea. If needed for testing maybe I can try to use my DE-0 nano with its 66kB of RAM to act as a faster logic analyzer with some dual port hub RAM and a P1V running inside to display it nicely. Another way is to try to use the 2MB SRAM board I have made as some high speed (80MHz) signal capture storage memory within the BeMicro itself. Though doing either of these will probably derail me a little more than I need right now.
In fact I may have found a way that in theory may allow 4x32 bit hub exec reads, 16x16 bit video pixel reads, 1x32 bit COG read/write and one refresh cycle all in a single hub window. This would be ideal for my future plans if it's realizable....
Did you mean 16x16 - seems to be more clocks than available, or is that using double edge clocking ?
What is the SDRAM MHz spec vs Core SysCLK on this interface ?
@jmg. Yes I mean 16x16, but that future beefed up version of my controller, assuming BeMicro MAX10 board signal integrity and FPGA timing allows it, would be using 2x the P1V clock and this actually improves the P1V read latency. The SDRAM part fitted on the MAX10 is rated at 143MHz with CL=3. I actually hope to run it at 144 MHz (~1% overclock) to also give me a nice hub window multiple for doing software USB1.1 one day (maybe with a couple of added HW instructions) and drive a hires panel that likes a 72MHz pixel clock. If the part had been the 166MHz version or I find I can overclock that much, it would also allow a more standard 80MHz P1V instead of 72MHz.
Well, sing out when you're ready to be derailed : )
Bitscope is also not a bad option, goes to 40 Msps for the base model
If you wanted an P123-A7 board, now's a good time. OzPropDev has a P1V running on it, and we should have a modern P2 as soon as Chip does the image for it
The SDRAM controller I've put together is now reading and writing bytes, words and longs at 80MHz from a P1V COG. I see it folding over as expected at 8MB boundaries, and I see bytes, words and long reads all correlating with the correct data in the correct byte lanes. It is not somehow reading from hub RAM at an aliased address (I checked that), and my SRAM board is not even fitted so it can't be coming from there either so it must be the SDRAM. The same address reads the same value after coming back to it later after other address reads so refresh is working too. I also notice if I leave it in reset for a bit, it can change it's values which is to be expected for SDRAM when not being refreshed.
This is awesome and I'm stoked! Still will do some further stress testing but I think I really have something I can build on from now as it's not DOA but just worked right away. Totally happy about hitting 80MHz operation first go too, those BeMicro layout guys must have done something right.
All I need is a good demo to somehow prove this new functionality...not sure what. Eventually I will want to do some more GCC changes again for this like I did for the SRAM so I can write some big C programs. :-D
Can you summarize the Board/Hardware and exactly what Clock speeds for SDRAM and P1V are used in this test & what LE count this needs ?
I guess this is 160MHz and 80MHz, but not the latest idea ?
Is the refresh invisible ? or can refresh be bumped, so any valid (data?) read gets priority, but otherwise it refreshes.
The P1V is unlikely to issue many consecutive reads
Does the clocks issued change with idle vs full RMW ? (that may save power & improved EMC ?)
The SDRAM controller I've put together is now reading and writing bytes, words and longs at 80MHz from a P1V COG.
All I need is a good demo to somehow prove this new functionality...not sure what.
@jmg,
BeMicro MAX10 board, clock is 80MHz for both P1V and SDRAM clock. As expected the LE count increased from my original number once I connected up all the real address/data pins and the extra latches were required. According to Quartus fitter the total resource usage now for my SDRAM Verilog module is 158 LEs.
For this initial test version, the current refresh cycle is totally invisible to the COG. More sophisticated versions like I discussd above could try to prioritize COGs over refresh to free up another read opportunity, but to begin with this works out for me and I like not having to worry about refresh affecting the COG during repeated long access bursts. Now I have the basic framework going I can try to extend things as I want with using other banks to gain more data accesses for my other purposes such as video frame buffer reads and potentially hub exec prefetching if I can double clock it...it will be very easy to modify my sequencer for these features, the only trick is just consuming the data at the right time.
This is not EMC optimized at all right now. It's probably being a power pig and total RF generator. Perhaps I could try to only access it when really required and keep things more idle otherwise, but once I add video it would just about be accessing it all the time anyway so there may not be a lot of point to that.
@jmg,
BeMicro MAX10 board, clock is 80MHz for both P1V and SDRAM clock. As expected the LE count increased from my original number once I connected up all the real address/data pins and the extra latches were required. According to Quartus fitter the total resource usage now for my SDRAM Verilog module is 158 LEs.
For this initial test version, the current refresh cycle is totally invisible to the COG. More sophisticated versions like I discussd above could try to prioritize COGs over refresh to free up another read opportunity, but to begin with this works out for me and I like not having to worry about refresh affecting the COG during repeated long access bursts. Now I have the basic framework going...
How many Read / Write / Refresh slots does this basic framework give ?
Related: I see ISSI have data up on HyperFLASH and HyperRAM, and claims samples of both are available.
Devices like IS25LP064A do seem to be available, and a pair of those could get quite close to HyperFLASH emulation for testing, with some verilog patching on the address fields.
In the IS25LP064A data Figure 8.60 FRQDTR AX Read Sequence (without command decode cycles)
shows 32b (64b from pair) reads can be in ~ 10 SysCLKs
Well the initial test version is basic: it just has one COG getting a 32 bit read or write, and a hidden refresh occurring automatically per hub cycle.
As time permits I am planning to extend this to yield 1x32 bit R/W from a single COG PLUS one hidden refresh PLUS 4x32 bit reads per hub cycle which could potentially be used for either some form of hub exec prefetching on a COG or ideally for reading 8x16 bit pixels from a frame buffer in this memory (e.g. 800x600x16bpp@60Hz). I already have this video resolution working well from my external SRAM board (see thread linked below) but it could be nice to allow double buffering or higher resolutions this extra SDRAM space provides and free up the SRAM for even faster access by all the COGs, giving maybe 2 SRAM accesses per COG per hub window on the MAX10 instead of just one. This should work with the P1 at 80MHz if the FPGA design reaches that rate.
Even later if I find the SDRAM clock can be doubled up to hit 144MHz on my board (P1V at 72MHz) then I may be able to fully max it out and get 16x16 bit reads for video PLUS 1x32 bit R/W COG access PLUS 4x32 bit reads for either hub exec or a DMA engine, PLUS a hidden refresh, all in the same hub cycle.
I have designed some memory access sequences for both these extensions which from what I can tell don't break any SDRAM rules but will still have to prove once I get more time. I am hoping my approach may even work on the DE0-nano too but I need to check its SDRAM timing carefully. I know it has a larger but slightly slower SDRAM device. From what I understand right now I'd expect my first extension above should be doable too on the DE0-nano, but the second one is pushing it slightly too far unless the P1V clock is lowered a bit.
So further testing has identified a few SDRAM problems which I am still working through.
I discovered I had a design bug with the refresh address counter being incremented on a clock cycle between ACTIVATE and PRECHARGE commands performing the refresh, this resulted in a weird error of 1 access every 16384 clocks on continuous reads of the same memory because the wrong bank was being precharged. I've move the refresh counter increment to another clock cycle which now fixed that.
I still see some strange errors which happens on a bank boundary. After a random pattern has been written to memory over all banks, I reset the seed to repeat the random sequence and go back and verify each sequential memory location. At the first bank boundary I hit, the data initially reads wrong, but then repeated read accesses of the same address (retries) comparing the data with the expected value gets the correct result. This continues for a while at successive address until eventually nothing matches. I need to carefully check my test itself is valid. The strange thing about this is one that it still happens at 10MHz down from 40MHz, so I don't expect it is a timing problem, and no hold time failures were reported. It also seemed to show itself after I changed the compile to solve the problem above.
In another earlier test I found that bytes were being reliably read, but not words or longs so I may have an issue with skew of DQM signals.
Also at 80MHz I do currently have some problems with data reliability. Initially when I first tested this I could do individual reads/writes at this rate and get the correct result which was great, but I've recompiled a few times since then and now I can't get the data reliably read again. My reported FMAX varies each compile of course but nowadays is mostly under 80MHz (73-77).
I will need to try to figure out in Quartus how to setup some timing constraints to try to ensure that the timing is met, or at least identified as a path that will fail at at certain speed..right now it is not being defined so the tools do not know what their target should be. Also I need to invest some more time in properly registering all my outputs and to limit combinational logic depths.
Don't worry, one way or another this thing is going to ultimately submit to my will. I've already seen it working and so I know it is doable.
I just found another similar bug to the refresh one where I wasn't carrying over the correct bank for the precharge operation on my COG's speculative SDRAM row access. This would certainly cause problems if the top bits of the address (in the COG's S register) ever changed during the period of the hub window dedicated to the COG for its access. Because it was not being latched we would likely be precharging the wrong bank and so not closing it down correctly before the next row access. It also likely explained why I saw more problems with longer or random gaps between reads, but not when I ran it faster in a more deterministic loop.
As well just now I was again just able to write and read the entire 8MB SDRAM memory with a pseudo random pattern at 80MHz too, so I know it is possible to hit that speed. When I do it slower by putting delays between reads/writes it quickly fails. So both these bugs were in the design logic so far, not necessarily signal/board timing. This is good, as I can easily fix my own design logic problems given some debug time to check the software results and figure it out. Speeding up actual timing, if this is required, will be much trickier.
Update: After the 2nd bug fix I've now run 1440 iterations of accessing bytes, words and longs over the full 8MB using random data writes to memory each time (seeded with cnt register per iteration) and it seems to be good now at 40MHz, however this new .sof file was a slow one and had an FMAX of less than 70MHz and the 80MHz test still fails on some rows, though interestingly not all and there is almost a pattern to it with some data bits being likely to fail more than others. Sometimes it gets a quite a long way through memory before it fails too. I need to register some more outputs which may help this, especially on the data bus during writes which could help shave some ns through an extra mux and also if the source register is a long way away from the IO pins.
Comments
At the moment I am bit too fearful to hook this controller logic directly into the actual SDRAM chip on the board in case I damage it with incorrect timing driving data out at the wrong times so I'm hoping to temporarily bring out the SDRAM control/data pins to some I/O pins on the board and scope them out to check if the logic is looking correct. There are no high end tools at home just a lowly Rigol scope and Saleae logic probe so I'm going to have to clock this design down to probably about 10MHz instead of 80MHz and see just what signals can be observed simultaneously and then match them up with the intended sequence. Then once I'm a bit more game with the overall design I'll set them back to connect to the real SDRAM chip pins and give it a try and see if it smokes. This is not the way you would normally want to do things but at the same time I don't know how to setup proper simulations etc.
As far as the SDRAM controller design goes it seems to only use about 88 LE's so far and I am happy enough with that. It may bump up a bit if I find I need to register some inputs or outputs but I don't imagine that would add more than 2x this number. Even then I'm confident I can fit it into my intended design with the SRAM and graphics controller I'd already worked on before and still leave enough room for a few other nice things like the HW multiply and some previous address mode enhancements and auto incrementing pointers.
There will be two 32 bit random read or write accesses available per hub cycle, one from a single fixed COG, the other from a basic refresh counter and ultimately a future type of DMA controller. I'd really like this controller to be able to stream data in the background from SDRAM into graphics SRAM in the spare hub slot left over for that purpose, perhaps for supporting transparent sprites and block copies etc. That will free up the main COG to just setup a transfer list in memory and the HW will go do the transfers automatically after that.
Initially this is intended for the BeMicro MAX 10 but I also have a DE0-Nano that also has some SDRAM on board which I (or anyone else) might try to port to if I manage to get something working. Time will tell. I hope I don't get too discouraged if I fail to get the tools doing what I want. That's probably the frustrating part given I have hardly used them much and simply don't know the process to optimize timing. I know precisely what I want and pretty much where it would tap into the P1V but now have to figure out how to get these tools to make it so.
With any luck I might just get something going but as I'm hardly a FPGA guru or high speed HW guy this will be a huge challenge for me with the 80MHz timing and SDRAM clocking. Though in the end I did get my 2MB SRAM working at the full hub speed with 8 transfers per hub window so there is some hope yet.
BTW what is the width of the sdram 32bits?
Perhaps you could map some of the sdram to a single cog as extended cog ram. This would give full cog speed code and/or data space (not register space tho).
I seemed to have missed this post before
At 80MHz it takes about 6 clocks to activate an SDRAM row, read two 16 bit values, and precharge it back. In the 16 clock cycles per hub without bank overlap that gives us two accesses, and leaving 4 more transfer cycle opportunities free. Perhaps it would be nice to play around with this try to sustain hub exec at full rate by reading larger bursts into some sort of hub exec FIFO with these extra cycles, though the random jumps between rows will cause stalls and refresh also complicates things. It might be interesting to investigate that later however because there is probably some extra scope to try to use some overlapping dedicated bank access for hub exec and this might get the 4x32 bits out of it in each hub cycle to yield 20MIPs on linear code while still leaving some reasonably useful amounts of general purpose external RAM for other purposes such as graphics data. Have to think more about it...
I might look at building an SRAM board x32 bits. Could access this full speed per cog instruction cycle so I could map it all to extended cog ram. This could then work as a data logger or snooper. I have been snooping USB but it's a problem to store a decent trace length with P1. Hub imposes some limitations in timing. I could add a special instruction to save and increment. This could make a nice logic monitor cheaply.
http://forums.parallax.com/discussion/161655/sram-expansion-board-with-svga-15bpp-color-graphics-and-text-on-p1v#latest
Gives higher in-line speed with no reduction of random access, but some small jitter.
This type of streaming / in-line reading is also suited to QuadSPI, and DDR QuadSPI etc.
Ideally, a CPU would have a short skip opcode ( was there a Natsemi CPU that did that ? )
On P1, the conditional flags could be use more aggressively for in-line results.
I see prices are up for Spansion HyperFLASH
1+ $3.42 at Avent, 100MHz 128Mbit, only in 24-BALL BGA
I looked into this with the SDRAM and I am starting to think perhaps it might be possible for all the following to coexist in the overall hub cycle. If this was achievable for MAX10 P1V it would be awesome.
3 SDRAM accesses per 16 clock cycle hub loop :
- 4x32 bit read accesses on a subset of SDRAM banks for preloading a FIFO with PASM CODE instructions at the rate of 20 Million longs per second. This enables an instruction prefetcher - however handling stalls and periodic refreshes of this bank would need to be developed further.
- 1x32 bit read or write data access from the main COG for accessing extended hub DATA in the remaining banks.
- Finally an additional 1x32 bit read or 16 bit DMA write access to the same non hub exec memory banks and this can also refresh these banks either when the DMA is idle, and/or have the ability to interrupt DMA in progress with the refresh. This background refresh allows the COG's SDRAM DATA access to remain deterministic, hitting it every hub cycle and keeping refresh hidden.
It's a real jigsaw puzzle to make it fit with the various SDRAM timing restrictions but for CAS latency of 2, read bursts of 2, write bursts of 1 and using some auto-precharge this sequence below may possibly work...depending on auto pre-charge constraints. 3 accesses are made, one using the hubExec bank, and two from other banks, the second write from the DMA has to be limited to 16 bits however. This may not be a problem as I am intending to mainly use it for reads of graphics data to then be written into my graphics SRAM or streaming audio out to an i2S interface, and if it is only used for the refresh purposes instead this wouldn't matter anyway.
I'm not going to think about it further at the moment, but it is in the back of my mind to consider this later if I manage to get the SDRAM going....
OzPropDev has a neat P2 80 MHz logic analyzer with 1 pixel per 80MHz lock onto its svga (xga?) screen. You could remove a couple of power pins and carry the rest of data through
I could use 4 @ 512Kx8 (2MB) 10ns SRAMs for hub and/or extended cog ram. Of course it does require 19 address pins, 32 data pins, plus 4 sets of CE, OE, WE (or get them together), for a maximum of 63 pins.
Alternately, I could use 1 @ 512Kx8 (512KB). This only requires 19 address pins, 8 data pins, and CE, OE and WE, giving 30 pins. It is possible to share I2C for booting. It is also possible to share pins with an SD card, although for what I am talking about here, I wouldn't do that because I want full SRAM access.
So, with a P1V (assume 100MHz as it is easier to work thru timing) it would be possible to that byte-wide SRAM and read/write one byte per system clock, making 4 bytes per instruction. This would work nicely for just one cog, but would require a hub style slot arrangement to share amongst a number of cogs.
Hub ram is 1:16, so its 2 clocks per cog slot. Therefore you would require 2 rotations to read/write a long, but only 1 slot for read/write a byte or word.
Certainly a couple of nice possibilities here
FWIW I need to run the P1V at 96MHz, else I have to run at 48MHz. I haven't even looked to see if this is possible without changing the xtal. Anyone know the answer???
Also I am finding it best if the memory bus has its own set of pins that leave the rest of the P1V (or P2) pins free for other purposes. That's where the BeMicro MAX 10 really shines with its very low cost and high pin count. However that obviously means you use your own custom FPGA design and not use the final commercial P2 device (if/when it ships).
On the BeMicro MAX10 board for example I plan to have 32 bit Port A with some internal/external peripherals and leave 32 bin Port B totally free for general purpose expansion use, and run my separate SRAM and SDRAM buses into the hub as well. That's all likely possible. The only real limitation is the 3 COGs, but the built in video and one day hopefully audio does help alleviate that. A larger MAX10M16 device would probably have been ideal and yielded those 5 extra COGs, it was such a pity they didn't populate that part instead even if would have pushed the price up accordingly. If you were to spin your own board instead of using the off the shelf development MAX10 board one could always use such a chip I guess. It's a nice match for P1V based designs.
I looked at the BeMicroA9 board and this cuts down on the free pins somewhat but could still be used for LVDS displays. My little SRAM expansion board probably fits that too as it has the same number and placement of GPIO pins on that 80 way connector.
On the DE0-nano, the two groups of IDC header pins lends themselves to supporting a nice fast 32 bit SRAM interface still leaving the lower pins available for some GPIO, but there are not a lot of I/O left over. For some applications there would be enough.
I thought SDRAMS needed more preamble stuff, and then could stream ?
What's the part number of the SDRAM used ?
Yeah if realizable it would be a thing of beauty with both 20MIPs peak hub exec and a much larger hub DATA RAM space too albeit for a single (main) COG.
The BeMicro MAX 10 schematic shows IS42S16400J-7TL is the SDRAM used. This device is 8MB total using 4 x 1MBx16 bit wide banks. The DE0-nano has a larger 4 bank 32MB part if memory serves me but also 16 bits wide.
For SDRAM reads you basically need to :
1 activate a row of some bank
2 wait tRCD before the row is ready to be read
3 issue the read command (with optional auto-precharge)
4 wait the (tCL) CAS latency before starting to read the data off the bus, usually 2 or 3 clocks
5 read the data burst and optionally issue a precharge command during this time to prepare to close the row of the bank
6 wait some time for the precharge operation to complete (tRP) until you can re-activate some other row in this bank, ensuring you do not do more than 1 complete memory access per bank in less than tRC cycle time.
There are also some other rules about how precharge is timed, aborting bursts, and back to back activate commands etc.
You can also keep a row open for longer than just one read or write burst and do random reads or writes inside it. You can also open other banks and overlap the instructions sent to each bank. When you get clever with interleaving you can start to do some fancy things to keep the sustained output bandwidth high. Natively an 80MHz SDRAM can probably pump out data on every clock cycle if you arrange you operations carefully over different banks. Random accesses usually reduces this rate, as does the need for periodic refresh of each row. Locking the overall sequence to the hub window guarantees some deterministic behaviour and bandwidth.
You mentioned an interest in possibly getting this running on a DE0-Nano. You also mentioned you'd have to slow down to work with existing logic analyser
My point is you can use a P123-A7/A9 to achieve 80 MHz multi channel capture, with a 512kB or 1MB deep hub ram, and OzPropDev has already done most of the heavy lifting making this work.
Physically the DE0 and A7/A9 headers line up, so you just need to remove the conflicting power pins (from a stackthrough 40 pin header which I can give you), and can assign the connected data pins as your 'logic probe points' for seeing and testing the DE0 memory interaction, at full speed.
regards
Lachlan
4x32 bit reads per hub window for preloading a hub exec FIFO from a given SDRAM dedicated bank (or banks). My example below just calls this BANK 1.
AND
2x32 bit read or write opportunities per hub window from the remaining banks. Eg. one for a special large hub RAM enabled COG, and one for background refresh (and a potential DMA engine) so as not to interfere with the large hub data COG's access timing keeping it fully deterministic. I show this as BANK 2 or 3 below, but they could be the same bank, just not BANK 1.
This hub cycle loops around continuously and SDRAM data is read in or output at the appropriate times in the cycle. The appropriate banks are selected based on the address issued. If the associated READs or WRITES are not done in a cycle then they are all NOP commands instead.
Update: Damn, found a problem with the timing in clock 13, you can't precharge immediately after a write, apparently you have to wait 2 clocks instead of 1. So this sequence won't work for the second 32 bit write transfer either.
My basic SDRAM controller interface has been completed and integrated into the P1V codebase and I'm now testing it. It sequences two 32 bit reads or one 32 bit read and a 32 bit write access over the hub cycle. It also has the necessary byte lane controls for byte/word read and writes and the startup/init sequence for SDRAM upon reset. Right now one of these read cycles is used with a refresh address counter, the other for COG 0. Later I am hoping more complex sequences with multiple banks can be built up to allow much higher performance particularly when clocked faster if that is possible on this board. In fact I may have found a way that in theory may allow 4x32 bit hub exec reads, 16x16 bit video pixel reads, 1x32 bit COG read/write and one refresh cycle all in a single hub window. This would be ideal for my future plans if it's realizable....
The BeMicro MAX 10 board was attached to a basic Saleae logic analyzer and it can now see external memory read and write commands coming out on the GPIO pins I mapped the signals to. One of the issues I have is that this analyzer device is USB2 bus speed limited so I have to clock the P1V down at 10MHz or so to check my sequence is correct before I am confident to speed it up and try it for real on the actual internal SDRAM device instead of GPIO pins for monitoring. But so far I am happy the signals seem to be doing what I wanted on the faked SDRAM bus. I still need get the actual COG access part working fully - right now I'm seeing strange things with what looks like 512 SDRAM reads (loading COG instructions?) at bootup. This may be a side effect of my mapping or some other bug.
For now on the P1V I've mapped the external SDRAM to addresses in the range of $80000000-$FFFFFFFF. The P1V does actually access this range during the booter sequence and trigger my SDRAM driver to do a read right away. At startup it reads the version byte from $FFF9FFFF which normally aliases to $FFFF in the hub ROM without a problem. This is probably fixable by modifying the booter code and it doesn't seem to harm it for now to have it read a different version value, but I'll need to be careful to see if anything else is hitting this address range in normal operation. I think SPIN took another addressing shortcut that I'd fixed up before when I did my 2MB SRAM testing last year.
The timing is very tight if you want it to all fit in a single hub window and so I'm also needing the negative clock edges too. The plan is to have it take 8 cycles so the hub window can still be accessed in a 3 instruction loop. The current code appears to fit in with this limit (just) but I'm hoping it will still work at proper speed and not need additional clocks.
The SDRAM controller logic only takes 96 LEs so there's plenty of room for it to fit in with my existing SRAM and other video driver code too.
Here's a couple of pics of the logic analyzer results showing the SDRAM signal sequence repeating in the hub window issuing ACTIVATE/READ/PRECHARGE commands etc.
Actually Tubular has given me an idea. If needed for testing maybe I can try to use my DE-0 nano with its 66kB of RAM to act as a faster logic analyzer with some dual port hub RAM and a P1V running inside to display it nicely. Another way is to try to use the 2MB SRAM board I have made as some high speed (80MHz) signal capture storage memory within the BeMicro itself. Though doing either of these will probably derail me a little more than I need right now.
Cheers,
rogloh
Looks impressive
Did you mean 16x16 - seems to be more clocks than available, or is that using double edge clocking ?
What is the SDRAM MHz spec vs Core SysCLK on this interface ?
Bitscope is also not a bad option, goes to 40 Msps for the base model
If you wanted an P123-A7 board, now's a good time. OzPropDev has a P1V running on it, and we should have a modern P2 as soon as Chip does the image for it
The SDRAM controller I've put together is now reading and writing bytes, words and longs at 80MHz from a P1V COG. I see it folding over as expected at 8MB boundaries, and I see bytes, words and long reads all correlating with the correct data in the correct byte lanes. It is not somehow reading from hub RAM at an aliased address (I checked that), and my SRAM board is not even fitted so it can't be coming from there either so it must be the SDRAM. The same address reads the same value after coming back to it later after other address reads so refresh is working too. I also notice if I leave it in reset for a bit, it can change it's values which is to be expected for SDRAM when not being refreshed.
This is awesome and I'm stoked! Still will do some further stress testing but I think I really have something I can build on from now as it's not DOA but just worked right away. Totally happy about hitting 80MHz operation first go too, those BeMicro layout guys must have done something right.
All I need is a good demo to somehow prove this new functionality...not sure what. Eventually I will want to do some more GCC changes again for this like I did for the SRAM so I can write some big C programs. :-D
Well done
Can you summarize the Board/Hardware and exactly what Clock speeds for SDRAM and P1V are used in this test & what LE count this needs ?
I guess this is 160MHz and 80MHz, but not the latest idea ?
Is the refresh invisible ? or can refresh be bumped, so any valid (data?) read gets priority, but otherwise it refreshes.
The P1V is unlikely to issue many consecutive reads
Does the clocks issued change with idle vs full RMW ? (that may save power & improved EMC ?)
You could do a rolling pattern test ?
What for a demo, hmm...
@jmg,
BeMicro MAX10 board, clock is 80MHz for both P1V and SDRAM clock. As expected the LE count increased from my original number once I connected up all the real address/data pins and the extra latches were required. According to Quartus fitter the total resource usage now for my SDRAM Verilog module is 158 LEs.
For this initial test version, the current refresh cycle is totally invisible to the COG. More sophisticated versions like I discussd above could try to prioritize COGs over refresh to free up another read opportunity, but to begin with this works out for me and I like not having to worry about refresh affecting the COG during repeated long access bursts. Now I have the basic framework going I can try to extend things as I want with using other banks to gain more data accesses for my other purposes such as video frame buffer reads and potentially hub exec prefetching if I can double clock it...it will be very easy to modify my sequencer for these features, the only trick is just consuming the data at the right time.
This is not EMC optimized at all right now. It's probably being a power pig and total RF generator. Perhaps I could try to only access it when really required and keep things more idle otherwise, but once I add video it would just about be accessing it all the time anyway so there may not be a lot of point to that.
How many Read / Write / Refresh slots does this basic framework give ?
Related: I see ISSI have data up on HyperFLASH and HyperRAM, and claims samples of both are available.
Devices like IS25LP064A do seem to be available, and a pair of those could get quite close to HyperFLASH emulation for testing, with some verilog patching on the address fields.
In the IS25LP064A data
Figure 8.60 FRQDTR AX Read Sequence (without command decode cycles)
shows 32b (64b from pair) reads can be in ~ 10 SysCLKs
As time permits I am planning to extend this to yield 1x32 bit R/W from a single COG PLUS one hidden refresh PLUS 4x32 bit reads per hub cycle which could potentially be used for either some form of hub exec prefetching on a COG or ideally for reading 8x16 bit pixels from a frame buffer in this memory (e.g. 800x600x16bpp@60Hz). I already have this video resolution working well from my external SRAM board (see thread linked below) but it could be nice to allow double buffering or higher resolutions this extra SDRAM space provides and free up the SRAM for even faster access by all the COGs, giving maybe 2 SRAM accesses per COG per hub window on the MAX10 instead of just one. This should work with the P1 at 80MHz if the FPGA design reaches that rate.
Even later if I find the SDRAM clock can be doubled up to hit 144MHz on my board (P1V at 72MHz) then I may be able to fully max it out and get 16x16 bit reads for video PLUS 1x32 bit R/W COG access PLUS 4x32 bit reads for either hub exec or a DMA engine, PLUS a hidden refresh, all in the same hub cycle.
I have designed some memory access sequences for both these extensions which from what I can tell don't break any SDRAM rules but will still have to prove once I get more time. I am hoping my approach may even work on the DE0-nano too but I need to check its SDRAM timing carefully. I know it has a larger but slightly slower SDRAM device. From what I understand right now I'd expect my first extension above should be doable too on the DE0-nano, but the second one is pushing it slightly too far unless the P1V clock is lowered a bit.
http://forums.parallax.com/discussion/161655/sram-expansion-board-with-svga-15bpp-color-graphics-and-text-on-p1v#latest
Very little additional logic is needed to have a Loadable Counter, that can auto-increment on every R/W, just like serial memory does.
I discovered I had a design bug with the refresh address counter being incremented on a clock cycle between ACTIVATE and PRECHARGE commands performing the refresh, this resulted in a weird error of 1 access every 16384 clocks on continuous reads of the same memory because the wrong bank was being precharged. I've move the refresh counter increment to another clock cycle which now fixed that.
I still see some strange errors which happens on a bank boundary. After a random pattern has been written to memory over all banks, I reset the seed to repeat the random sequence and go back and verify each sequential memory location. At the first bank boundary I hit, the data initially reads wrong, but then repeated read accesses of the same address (retries) comparing the data with the expected value gets the correct result. This continues for a while at successive address until eventually nothing matches. I need to carefully check my test itself is valid. The strange thing about this is one that it still happens at 10MHz down from 40MHz, so I don't expect it is a timing problem, and no hold time failures were reported. It also seemed to show itself after I changed the compile to solve the problem above.
In another earlier test I found that bytes were being reliably read, but not words or longs so I may have an issue with skew of DQM signals.
Also at 80MHz I do currently have some problems with data reliability. Initially when I first tested this I could do individual reads/writes at this rate and get the correct result which was great, but I've recompiled a few times since then and now I can't get the data reliably read again. My reported FMAX varies each compile of course but nowadays is mostly under 80MHz (73-77).
I will need to try to figure out in Quartus how to setup some timing constraints to try to ensure that the timing is met, or at least identified as a path that will fail at at certain speed..right now it is not being defined so the tools do not know what their target should be. Also I need to invest some more time in properly registering all my outputs and to limit combinational logic depths.
I just found another similar bug to the refresh one where I wasn't carrying over the correct bank for the precharge operation on my COG's speculative SDRAM row access. This would certainly cause problems if the top bits of the address (in the COG's S register) ever changed during the period of the hub window dedicated to the COG for its access. Because it was not being latched we would likely be precharging the wrong bank and so not closing it down correctly before the next row access. It also likely explained why I saw more problems with longer or random gaps between reads, but not when I ran it faster in a more deterministic loop.
As well just now I was again just able to write and read the entire 8MB SDRAM memory with a pseudo random pattern at 80MHz too, so I know it is possible to hit that speed. When I do it slower by putting delays between reads/writes it quickly fails. So both these bugs were in the design logic so far, not necessarily signal/board timing. This is good, as I can easily fix my own design logic problems given some debug time to check the software results and figure it out. Speeding up actual timing, if this is required, will be much trickier.
Update: After the 2nd bug fix I've now run 1440 iterations of accessing bytes, words and longs over the full 8MB using random data writes to memory each time (seeded with cnt register per iteration) and it seems to be good now at 40MHz, however this new .sof file was a slow one and had an FMAX of less than 70MHz and the 80MHz test still fails on some rows, though interestingly not all and there is almost a pattern to it with some data bits being likely to fail more than others. Sometimes it gets a quite a long way through memory before it fails too. I need to register some more outputs which may help this, especially on the data bus during writes which could help shave some ns through an extra mux and also if the source register is a long way away from the IO pins.