Update: After the 2nd bug fix I've now run 1440 iterations of accessing bytes, words and longs over the full 8MB using random data writes to memory each time (seeded with cnt register per iteration) and it seems to be good now at 40MHz,
Sounds like good progress - are those Random Addr, and Random value tests ?
Did you try warming the SDRAM over some tests ?
however this new .sof file was a slow one and had an FMAX of less than 70MHz and the 80MHz test still fails on some rows, though interestingly not all and there is almost a pattern to it with some data bits being likely to fail more than others. Sometimes it gets a quite a long way through memory before it fails too. I need to register some more outputs which may help this, especially on the data bus during writes which could help shave some ns through an extra mux and also if the source register is a long way away from the IO pins.
That 70MHz may not include the SDRAM Tsu,Th margins either, but it does sound like their reported speeds are a reasonable guide.
Sounds like good progress - are those Random Addr, and Random value tests ?
Did you try warming the SDRAM over some tests ?
That 70MHz may not include the SDRAM Tsu,Th margins either, but it does sound like their reported speeds are a reasonable guide.
I was actually thinking about this last night, I want to move to random address and random data to be 100% confident assuming I find a pseudo random generator that covers all addresses. I also wanted a test with some random delays between both writes and reads to make sure all hub clock initial timing is checked for the memory accesses, instead of hitting the same clock each time. I've already partially done the latter in some of my tests, but not the random address part, only random data. This hid some of the precharge bugs because for 4095 of 4096 accesses the same bank was hit in the next access. I want to randomize it more to catch errors. Didn't try warming SDRAM yet.
The setup and hold times of the SDRAM device on the BeMicro MAX10 board are 1.5 and 0.8ns respectively. I clock the new data signals on the rising COG clock and feed the SDRAM the inverted (falling) COG clock, so in an ideal case there should be 0.5 cycle approx for both setup and hold if all output delays are the same (won't be). Some of the other control signals are not yet registered and I'm working to remedy that to constrain it more. Unfortunately the address setup is the trickiest to register due to limited clock cycles before the data should be ready. I can always fix that by extending the RDLONG/WORD/BYTE timing for external RAM to over 8 clocks but I'm trying to keep it within 8 to maintain the ability to write tight hub loops with a memory read/write and two instructions, like we can normally get for internal HUB RAM. The tricky part is that the address doesn't get known until the S register is read (latched in M2) and we want to start the ACTIVE command as soon as possible. I could speed it up a bit by eliminating support for doing things like "RDLONG data, phsa" etc, where phsa register contents is used instead of normal normal S registers below $1F0 etc (ie. in the P1V verilog use sy over sx), but until I get desperate for every single ns I'm still trying to keep support for doing weird things like that which at times can be useful.
random, that is statistical methods only give statistical results. In the early days we had observed: a computer program sometimes crashed. Took some time to be able to provoke the crash and so step by step it could be isolated: when data involving a lot of bit changes coinsided with the same situation on the data bus, spikes were generated on the back plane and data was corrupted. Easy to imagine that random bit patterns in realistic time would not show this result. So my advice: take care, timing is in line with the data sheets and keep the eye diagrams open.
random, that is statistical methods only give statistical results. ...
Well, yes, but here random is not applied in the true statistical sense.
Here it means more 'not simply sequential' and would likely include test patterns to cover exactly those all-bit change corner cases, as well as varying W-R delays etc.
So I just found a nice tweak I put into the COG Verilog that gave me 0.5 clock cycles more address setup time by latching "sy" on the negative clock edge of M1 phase instead of the positive edge half a cycle later. This tweaked version is running my SDRAM test at 80MHz now and has done a few hundred complete iterations of my random 8MB test (random data, not random address yet) with bytes/words and longs, and was passing them all without detecting any data errors. That didn't happen before and this is also despite being listed as a slow FMAX compile too (68/73MHz) and still not registering the address. I may also be able to register it now that it arrives earlier than before.
So things are looking up for a full speed COG with SDRAM.
LOL, I will be in need of beer after all this. There are still some problems... :-(
Packing this into one HUB cycle is obviously a nice target, but I think many would be ok with a longer SDRAM access, which could be based on address. (genuine SRAM has normal any cog in HUB cycle operation, SDRAM has one-COG in 1 or maybe 2 HUB cycles).
Maybe SRDAM burst can do 16 on first, then 8 on following data, which makes the speed impact smaller still ?
Yeah, taking more clocks would likely simplify things but I'm still aiming for reaching 8 clock cycles per access with 80MHz. I do expect it should be achievable. This is technically one more clock than internal hub RAM best case, but it won't be noticed in a tight loop of 16 clocks. The reason I want this is because I want to support an LMM/XMM type loop executing from external SDRAM at 5MIPs (4.5 without my prior auto-increment RDLONG mod). Then later if some form of hub exec type of prefetching can be worked through this may possibly be boosted up to 20MIPs peak in the absence of jumps, but that is a stretch goal further out either after or instead of my video frame buffer integration plan. I also want to try to externally clock the SDRAM at 2x at some point to see if that is possible on the board with the I/O pin timing. The COG can be run at 72MHz so that actually may help slightly.
Earlier I found that introducing random delays during writes at 80MHz were still causing problems on a bank boundary (for bytes/words - not longs interestingly), so I may have some clocking issue or some data changing when it should not be because I'm not latching it, or I am not latching it at the right time. Back to back accesses were working fine.
Part of the problem here is that I am not following normal HW design methodology with proper Verilog simulations etc as I'm not a trained HW design engineer (I'm primarily SW with some extra HW capabilities). Also as a result of that I don't know enough to control Quartus to fully specify all the timing constraints. After all these recent changes I probably need to sit down, draw out all the timing diagrams again and double check each input condition to see if I have missed some corner case somewhere. It seems really close, but there is at least one more bug to resolve, maybe more.
Found an issue with some spurious SDRAM memory accesses during loading of the COG before RUN is enabled and fixed that. I also moved my bank address latching one cycle earlier and used that for both ACTIVE and PRECHARGE cycles, as there was still some possibility of the address changing in the meantime. The combination apparently results in allowing random data with randomized write and/or read timing using an 80MHz clock to work now for all transfer sizes.
So for now at least, I have no known issues with SDRAM, but still need further testing with randomized addresses instead of sequential to stress it further. Plus I want to sanity check the entire timing diagram another time and see if I can register more signals where possible.
Got random address generation working and tested with my external SDRAM on the P1V at 80MHz - it seems to be working well. Tonight I put together a test harness in SPIN to control a COG running PASM that would iterate over SDRAM addresses in the desired range using a linear congruential generator (apparently that is the name for it). It is designed to not repeat the same address in the overall memory range cycle thereby overwriting the previous value and ruining the test. I also added the MUL instruction to PASM in HW to speed up the random number generation significantly.
The address sequence fully exercises the four memory banks in the SDRAM, jumping around in the different banks and rows pretty much randomly. The SDRAM COG then writes either 32/16/8 bit pseudo random data (separate generator for data patterns) to the memory addresses generated. Then afterwards, by using the same seeds, I can read data back at these same random memory addresses throughout the SDRAM and it matches the expected random data each time it is run. I also introduced random timing between accesses using the least significant bits of the random data to do some short WAITCNTs. This should ensure SDRAM accesses are initiated on each of the 16 different hub clocks in the window.
So all in all I am quite pleased with this now and it seems quite solid with no failures yet. Will have to see if I run into any further timing issues down the track but for now things look quite good and I can continue on.
Now wondering if I might take a look at my DE0-nano and see if I can port this basic SDRAM memory controller there too. Shouldn't be as hard now the design is operational. Most of that will be tweaking the address range for the different column and row arrangement of the 32MB SDRAM device and setting up the correct pinout for the board. The core SDRAM controller/COG parts should remain the same.
So all in all I am quite pleased with this now and it seems quite solid with no failures yet. Will have to see if I run into any further timing issues down the track but for now things look quite good and I can continue on.
Sounding great.
Can you do a quick summary of this working version's Refresh + User IO performed, in how many P1V SysCLKs ?
Now wondering if I might take a look at my DE0-nano and see if I can port this basic SDRAM memory controller there too. Shouldn't be as hard now the design is operational. Most of that will be tweaking the address range for the different column and row arrangement of the 32MB SDRAM device and setting up the correct pinout for the board. The core SDRAM controller/COG parts should remain the same.
Sounds like a good idea, as that also gives another set of test data.
Sounding great.
Can you do a quick summary of this working version's Refresh + User IO performed, in how many P1V SysCLKs ?
This is the most basic version I've been talking about above. It refreshes one SDRAM row per hub cycle which is more than sufficient. This SDRAM has banks of 4k rows which all need to be refreshed in 64ms. Because I'm using standard READ cycles instead of the dedicated refresh instruction which refreshes all banks in parallel, I actually need to do each bank separately. So that effectively adds up to refreshing 16k rows in 64 ms instead of 4k and this is one in about 4us or ~250kHz refresh cycle rate required. With one refresh per hub cycle, this should allow P1V operation down to 4MHz clock rates instead of 80MHz but not the low frequency RC oscillator which is too slow for refresh.
For the single clocked SDRAM (not 2x COG CLK yet), I found that 8 P1V cycles are just sufficient for a single COG to get either a read or write access per hub window as required. This is great as it still allows 2 additional instructions after the RDLONG to be executed before the next hub cycle sweet spot arrives, like we have today with regular HUB SRAM.
This driver is using the SDRAM in the simplest way and there is sufficient bandwidth in the SDRAM to do several extra transfers in parallel with this, that's hopefully coming next as I get more time to continue this.
So I just ported this memory controller code to DE0-Nano too and its 32MB SDRAM now functioning at 40MHz with random data and random address testing over its entire range. 80MHz operation is a little flakey however on the odd 16 bit word addresses - still testing...
Quartus TimeQuest reports FMAX of 75 and 84MHz for hot/cold slow model which seems pretty fast. I may have to register all these control signals in the end which is the right thing to do for more consistency.
So I just ported this memory controller code to DE0-Nano too and its 32MB SDRAM now functioning at 40MHz with random data and random address testing over its entire range. 80MHz operation is a little flakey however on the odd 16 bit word addresses - still testing...
Comparing the SDRAM specs, and FPGA speed-bin, and general PCB layouts, would you expect the DE0-Nano to come in a little worse ?
Originally actually I thought the SDRAM chip was slower on the DE0-nano, but I had an outdated data sheet for the rev B device. The part on my own DE0-nano is the rev G device (IS45S16160G) which on paper if anything looks slightly faster than the part on the BeMicroMAX 10 board (IS42S16400J). Both boards have ISSI SDRAM parts.
From a layout perspective it is hard to tell - both SDRAM parts are about as close to the BGA FPGAs and I can't really see the tracks on the board.
The FPGA speed of the Cyclone IV is C6 on my DE0-Nano and the C8ES on the MAX10 board (which I think is slower grade).
I'm seeing something consistently weird at a given 16 byte row of the RAM. Offset 0x4050 reads back changing data (so maybe SPIN is writing something there causing my test to fail). This is happening at 40MHz too. I need to check the test, maybe this memory is really working ok or maybe I have an addressing bug somewhere.
Update: DOH! I had taken a fresh P1V to port my SDRAM controller into and it did not have my earlier mod to prevent SPIN calls from accessing hub memory over 64kB. I need to fix that now. What is weird is that the BeMicro ROM version I have is totally different, perhaps it was encrypted in the original....? Yes I recall taking some decoded hex file once so I could patch it.
Update: DOH! I had taken a fresh P1V to port my SDRAM controller into and it did not have my earlier mod to prevent SPIN calls from accessing hub memory over 64kB....
Spent time tonight registering all my SDRAM controller outputs. It seems to have possibly payed off as I am now running the DE0-Nano at 80MHz with the SDRAM and it appears to be working with no problems so far using random addressing and random data with random timing. The FMAX also seems to have gone up a bit too, perhaps as a result of this change. I have FMAX of 78/87 for slow 85C/0C model respectively.
This version has my "superCOG" (SDRAM HUB enabled + MUL/MULS instructions as COG0) plus 7 regular COGs, high 16kB ROM, as well as the SDRAM controller itself. It takes 69% of the Cyclone IV EP4CE22F17C6 device's LEs resources.
I could possibly consider releasing the binary file if anyone with a DE-Nano wants to risk having a play with a P1V with 32MB of SDRAM accessible at address $80000000 upwards. The I/O pinning is the standard arrangement. I cannot guarantee it doesn't have problems, won't crash, or might not be slowing destroying the SDRAM chip or FPGA with potentially bad bus timing, but it is apparently working on my board at 80MHz with my limited testing done so far. YMMV.
This looks more and more interesting. It sounds like something I could use for an emulator project I had in mind. I don't really need the P2, just a P1 enhanced just this way. And my DE0-Nano sits idle.
But I won't have time free for a few months yet
Spent time tonight registering all my SDRAM controller outputs. It seems to have possibly payed off as I am now running the DE0-Nano at 80MHz with the SDRAM and it appears to be working with no problems so far using random addressing and random data with random timing. The FMAX also seems to have gone up a bit too, perhaps as a result of this change. I have FMAX of 78/87 for slow 85C/0C model respectively.
Can you do a simple spread sheet or table of each Clock Cycle this uses, and what the pin-states & BUS direction are ?
For P2, the bit wiggling would usually be SW, which would be s-l-o-w, but I was wondering about using multiple streamers here:
* One streamer, 2 or 4 bits wide manages the control lines, with some small field length
* Another wider Data streamer is carefully launched when the data windows appear.
This assumes a level of fine granularity is possible in P2 around streamers, but you can pause to 1 SysCLK precision.
One COG would be managing SDRAM pins flat-out.
Nice work. I am curious if you have estimated what device would be required for you to get 16 cores? When you compile for the FPGA image, is there some info that states how many logic elements are required to host the image? The Cyclone you mention is 22320 LE, and you are 8 cores at 69%. So I assume you could get a few more on there, but no where near 16. 69% = 15400.8 so does this directly extrapolate out to 16 cores = 30800 LE?
Comments
Sounds like good progress - are those Random Addr, and Random value tests ?
Did you try warming the SDRAM over some tests ?
That 70MHz may not include the SDRAM Tsu,Th margins either, but it does sound like their reported speeds are a reasonable guide.
I was actually thinking about this last night, I want to move to random address and random data to be 100% confident assuming I find a pseudo random generator that covers all addresses. I also wanted a test with some random delays between both writes and reads to make sure all hub clock initial timing is checked for the memory accesses, instead of hitting the same clock each time. I've already partially done the latter in some of my tests, but not the random address part, only random data. This hid some of the precharge bugs because for 4095 of 4096 accesses the same bank was hit in the next access. I want to randomize it more to catch errors. Didn't try warming SDRAM yet.
The setup and hold times of the SDRAM device on the BeMicro MAX10 board are 1.5 and 0.8ns respectively. I clock the new data signals on the rising COG clock and feed the SDRAM the inverted (falling) COG clock, so in an ideal case there should be 0.5 cycle approx for both setup and hold if all output delays are the same (won't be). Some of the other control signals are not yet registered and I'm working to remedy that to constrain it more. Unfortunately the address setup is the trickiest to register due to limited clock cycles before the data should be ready. I can always fix that by extending the RDLONG/WORD/BYTE timing for external RAM to over 8 clocks but I'm trying to keep it within 8 to maintain the ability to write tight hub loops with a memory read/write and two instructions, like we can normally get for internal HUB RAM. The tricky part is that the address doesn't get known until the S register is read (latched in M2) and we want to start the ACTIVE command as soon as possible. I could speed it up a bit by eliminating support for doing things like "RDLONG data, phsa" etc, where phsa register contents is used instead of normal normal S registers below $1F0 etc (ie. in the P1V verilog use sy over sx), but until I get desperate for every single ns I'm still trying to keep support for doing weird things like that which at times can be useful.
Here it means more 'not simply sequential' and would likely include test patterns to cover exactly those all-bit change corner cases, as well as varying W-R delays etc.
So things are looking up for a full speed COG with SDRAM.
Maybe SRDAM burst can do 16 on first, then 8 on following data, which makes the speed impact smaller still ?
To collect the free beer you will have to come to Sydney though
Earlier I found that introducing random delays during writes at 80MHz were still causing problems on a bank boundary (for bytes/words - not longs interestingly), so I may have some clocking issue or some data changing when it should not be because I'm not latching it, or I am not latching it at the right time. Back to back accesses were working fine.
Part of the problem here is that I am not following normal HW design methodology with proper Verilog simulations etc as I'm not a trained HW design engineer (I'm primarily SW with some extra HW capabilities). Also as a result of that I don't know enough to control Quartus to fully specify all the timing constraints. After all these recent changes I probably need to sit down, draw out all the timing diagrams again and double check each input condition to see if I have missed some corner case somewhere. It seems really close, but there is at least one more bug to resolve, maybe more.
So for now at least, I have no known issues with SDRAM, but still need further testing with randomized addresses instead of sequential to stress it further. Plus I want to sanity check the entire timing diagram another time and see if I can register more signals where possible.
The address sequence fully exercises the four memory banks in the SDRAM, jumping around in the different banks and rows pretty much randomly. The SDRAM COG then writes either 32/16/8 bit pseudo random data (separate generator for data patterns) to the memory addresses generated. Then afterwards, by using the same seeds, I can read data back at these same random memory addresses throughout the SDRAM and it matches the expected random data each time it is run. I also introduced random timing between accesses using the least significant bits of the random data to do some short WAITCNTs. This should ensure SDRAM accesses are initiated on each of the 16 different hub clocks in the window.
So all in all I am quite pleased with this now and it seems quite solid with no failures yet. Will have to see if I run into any further timing issues down the track but for now things look quite good and I can continue on.
Now wondering if I might take a look at my DE0-nano and see if I can port this basic SDRAM memory controller there too. Shouldn't be as hard now the design is operational. Most of that will be tweaking the address range for the different column and row arrangement of the 32MB SDRAM device and setting up the correct pinout for the board. The core SDRAM controller/COG parts should remain the same.
Sounding great.
Can you do a quick summary of this working version's Refresh + User IO performed, in how many P1V SysCLKs ?
Sounds like a good idea, as that also gives another set of test data.
This is the most basic version I've been talking about above. It refreshes one SDRAM row per hub cycle which is more than sufficient. This SDRAM has banks of 4k rows which all need to be refreshed in 64ms. Because I'm using standard READ cycles instead of the dedicated refresh instruction which refreshes all banks in parallel, I actually need to do each bank separately. So that effectively adds up to refreshing 16k rows in 64 ms instead of 4k and this is one in about 4us or ~250kHz refresh cycle rate required. With one refresh per hub cycle, this should allow P1V operation down to 4MHz clock rates instead of 80MHz but not the low frequency RC oscillator which is too slow for refresh.
For the single clocked SDRAM (not 2x COG CLK yet), I found that 8 P1V cycles are just sufficient for a single COG to get either a read or write access per hub window as required. This is great as it still allows 2 additional instructions after the RDLONG to be executed before the next hub cycle sweet spot arrives, like we have today with regular HUB SRAM.
This driver is using the SDRAM in the simplest way and there is sufficient bandwidth in the SDRAM to do several extra transfers in parallel with this, that's hopefully coming next as I get more time to continue this.
Looks like your ready to collect your BEER prizes.
Quartus TimeQuest reports FMAX of 75 and 84MHz for hot/cold slow model which seems pretty fast. I may have to register all these control signals in the end which is the right thing to do for more consistency.
Comparing the SDRAM specs, and FPGA speed-bin, and general PCB layouts, would you expect the DE0-Nano to come in a little worse ?
IIRC the MAX10 is a quite new process ?
From a layout perspective it is hard to tell - both SDRAM parts are about as close to the BGA FPGAs and I can't really see the tracks on the board.
The FPGA speed of the Cyclone IV is C6 on my DE0-Nano and the C8ES on the MAX10 board (which I think is slower grade).
I'm seeing something consistently weird at a given 16 byte row of the RAM. Offset 0x4050 reads back changing data (so maybe SPIN is writing something there causing my test to fail). This is happening at 40MHz too. I need to check the test, maybe this memory is really working ok or maybe I have an addressing bug somewhere.
Update: DOH! I had taken a fresh P1V to port my SDRAM controller into and it did not have my earlier mod to prevent SPIN calls from accessing hub memory over 64kB. I need to fix that now. What is weird is that the BeMicro ROM version I have is totally different, perhaps it was encrypted in the original....? Yes I recall taking some decoded hex file once so I could patch it.
This version has my "superCOG" (SDRAM HUB enabled + MUL/MULS instructions as COG0) plus 7 regular COGs, high 16kB ROM, as well as the SDRAM controller itself. It takes 69% of the Cyclone IV EP4CE22F17C6 device's LEs resources.
I could possibly consider releasing the binary file if anyone with a DE-Nano wants to risk having a play with a P1V with 32MB of SDRAM accessible at address $80000000 upwards. The I/O pinning is the standard arrangement. I cannot guarantee it doesn't have problems, won't crash, or might not be slowing destroying the SDRAM chip or FPGA with potentially bad bus timing, but it is apparently working on my board at 80MHz with my limited testing done so far. YMMV.
Hmm 69% fit. The next size down EP4CE15 is has 15408 LE's, it'd be a close fit (probably too tight)
But I won't have time free for a few months yet
Can you do a simple spread sheet or table of each Clock Cycle this uses, and what the pin-states & BUS direction are ?
For P2, the bit wiggling would usually be SW, which would be s-l-o-w, but I was wondering about using multiple streamers here:
* One streamer, 2 or 4 bits wide manages the control lines, with some small field length
* Another wider Data streamer is carefully launched when the data windows appear.
This assumes a level of fine granularity is possible in P2 around streamers, but you can pause to 1 SysCLK precision.
One COG would be managing SDRAM pins flat-out.
Nice work. I am curious if you have estimated what device would be required for you to get 16 cores? When you compile for the FPGA image, is there some info that states how many logic elements are required to host the image? The Cyclone you mention is 22320 LE, and you are 8 cores at 69%. So I assume you could get a few more on there, but no where near 16. 69% = 15400.8 so does this directly extrapolate out to 16 cores = 30800 LE?