Here is my latest version:
- Cache size can be set from 1KB up to 16KB of HUB memory
- Cache line size can be from 16 bytes up to 512 bytes.
- The maximum number of cache lines is 256
CONstant calculations in the driver will determine the "optimum" schedule for refreshes, given the number of bytes in a cache line. NOTE: For this to be correct, you need to set the MY_CLKFREQ constant in the driver to the CLKFREQ you will be using for the overall project.
In order to free up enough longs to have 256 cache tags in COG memory, I squeezed and squeezed. Now, there is not enough room for Steve's (optional) debugging code (which looks to need 9 longs) unless you lower the MAX_TAGCOUNT to 128. I might be able to find one more long to free up...
However, now there is lots of available memory for further initialization code to compute the CLKFREQ dependent refresh values at runtime (it looks like there are $C0 == 192 longs available with a 256-entry tag vector), so that will probably be my last version of this particular driver (expect for bug fixes, of course).
And I suppose I will need to give estimated speeds... and write a speed test for comparison purposes... Sigh.
Here is my latest version:
- Cache size can be set from 1KB up to 16KB of HUB memory
- Cache line size can be from 16 bytes up to 512 bytes.
- The maximum number of cache lines is 256
Nice work
So you're using CTRB to gate CTRA ? That's pretty cool!
I used gating a lot in the original driver, but it never struck me to use the other counter output as a gate. :yummy:
Regarding the clock frequency, I often use a "properties.spin" file for global program constants which makes this easy.
Since you have the SDRAM throughput around 7MB/s, it seems that a great video driver that pulls data from SDRAM is more likely to succeed now. Once I have all my hardware in FAB, I'll start looking at it again.
jazzed:... seeing that Ding used 2 latches for the addresses I was wondering if a 16 bit interface could be useful... I mean you may get 2x the performance from burst-read and write... just so-so for byte/unaligned... do you have any thoughts on this ?
jazzed:... seeing that Ding used 2 latches for the addresses I was wondering if a 16 bit interface could be useful... I mean you may get 2x the performance from burst-read and write... just so-so for byte/unaligned... do you have any thoughts on this ?
Hi Ale. I have a board with a 16 bit interface and use rdword/wrword to read/write 16 bits per HUB cycle. Using CTRB to gate CTRA might allow higher throughput. I might have a chance to play with that later.
So you're using CTRB to gate CTRA ? That's pretty cool!
I used gating a lot in the original driver, but it never struck me to use the other counter output as a gate. :yummy:
I think I got the idea from another thread that was discussing generating a fixed number of pulses using the counters -- maybe I'll go hunting for it. My revelation was from your code -- I was under the assumption that SDRAM wanted a constant clock, and your code played lots of games with the clock. I hadn't thought through that the clock was simply used to trigger internal state changes in the SDRAM, and that I could control when those transitions happened for my convenience.
But I think my real insight was seeing an 8-byte read loop with two wrlongs, and that the loop control and ramclk control could fit into the two left-over hub access delay slots for those two wrlongs, if I could control the ramclk burst length in one instruction.
Regarding the clock frequency, I often use a "properties.spin" file for global program constants which makes this easy.
Yup, that approach would work, but now that there's a lot of room for initialization code (as part of the 256 longs of the tag vector), I think I might just make it runtime-calculated.
Since you have the SDRAM throughput around 7MB/s, it seems that a great video driver that pulls data from SDRAM is more likely to succeed now. Once I have all my hardware in FAB, I'll start looking at it again.
Well, I haven't measured it yet, and I have to revise my spreadsheet with the current numbers, so 7MB/s might be a little optimistic -- as I recall, that required a 100MHz system clock, and IIRC 512 byte cache lines. Also, my estimates were for raw block I/O, not cached. Because of the way the cache code works, write speeds will be one-half the read speeds, because each cache line is read, even if it will be immediately overwritten.
I do want to put together an accurate sustained speed measurement, and also a "raw" driver that simply does the I/O to caller-supplied buffer(s) without caching, for maximum speed. But that shouldn't be too hard now.
jazzed:... seeing that Ding used 2 latches for the addresses I was wondering if a 16 bit interface could be useful... I mean you may get 2x the performance from burst-read and write... just so-so for byte/unaligned... do you have any thoughts on this ?
Hi Ale. I have a board with a 16 bit interface and use rdword/wrword to read/write 16 bits per HUB cycle. Using CTRB to gate CTRA might allow higher throughput. I might have a chance to play with that later.
A board using a 16-bit data width would require 21 Propeller pins (jazzed's current GG board uses 20), and two 8-bit latches.
But I don't think a one-cog driver could make it all the way to 2x the performance of the 8-bit driver, because there are no helpful instructions for gluing together the two halves of a long quickly so that you can use wrlong instead of wrword.
Double the burst rate performance means 6 system clocks per byte, or 12 system clocks per word read. But using wrword means that the fastest transfer speed would be 16 clocks per word. If you used wrlong, you have 24 system clocks to read two 16-bit words, assemble them into a long, and perform the wrlong, or in other words, you have to read D15..D0 twice, and assemble into a long, in 4 instructions. I don't see the four instructions that can do that, using uninitialized storage. I see five instructions, but that would make for interesting interleaving in the I/O routines.
A board using a 16-bit data width would require 21 Propeller pins (jazzed's current GG board uses 20), and two 8-bit latches.
But I don't think a one-cog driver could make it all the way to 2x the performance of the 8-bit driver, because there are no helpful instructions for gluing together the two halves of a long quickly so that you can use wrlong instead of wrword.
I agree and I do use 21 pins on the MicroPropPC board. The simple case using rdword/wrword with tricks for updating the HUB pointer would load/store at most 2 bytes every 16 clock ticks or 10MB/s at 80MHz with one COG. Ding's byte wide driver almost gives that with fewer pins. That 8 byte loop is tight!
I just thought I'd mention: the drivers I have been posting are all for Steve's GG SDRAM board -- I have not been working with a two-latch board, until I can straighten out a few hardware issues.
So I have been testing with a GG Propeller Platform board and the GG SDRAM board, and also with a Propeller Prototype board with GGPP spaced headers mounted on it (a "ProtoPP" board) and the SDRAM board. This means everybody can play...
Also, the last time I modified the caching driver for a two-latch, 14-IO-Pin version, it seems that the send_ADDRESS routine grew by only four instructions (16 clocks) and the refresh routine grew by only one instruction, giving a "slowdown" of 20 system clocks additional per block read. For a 32-byte block, which need about 640 clocks (including overhead), that is only about a 3% slowdown. For a 256-byte block, there is less than a 1% slowdown.
The initialization code now calculates the proper refresh parameters for the current clock speed, so there's no need to edit the driver source to modify the clock speed.
The refresh burst length calculations is a little better, so refreshes are done just a little less often (perhaps 10% less often).
I don't see much more to modify in this driver now (except for bug fixes, which I am sure will be needed). So next up is a speed test for the SdramTest.spin program, and a raw block driver built strictly for speed.
And maybe a version for the 14-pin, two latch version of the hardware...
I have a GG Propeller Platform board ready to go that meets your specs. Let me know if i should make it.
I've been very busy lately...
Please, please "make it." I saw that you wrote earlier that you were working on a design -- that would be great! It would save me a longish learning curve in board layout, and probably some early duds...
Many, many questions:
How much?
How many?
From you or from GG?
Bare boards or populated? with or without headers mounted?
Now that I've asked, of course I have preferences. I could probably use at least two populated, without headers mounted, and several more bare boards (I have other SDRAM chips to try...).
I also have a modified test program with a read speed test -- I'll post the results in a few minutes.
Attached is my latest version of the SDRAM Cache driver and test program, and also read speed test results for the six supported cache line sizes and four different clock speeds.
The driver in this version is the same as the previous version I posted -- only the SdramTest.spin program is different. It contains a Speed Test, that reads all of the SDRAM memory as fast as possible (without data verification). It uses another PASM cog to do the reads, with the next read submitted in the next hub access window after determining the previous read is done. That cog also does the clock-level timing, while the Spin code simply reports the results.
By comparison, the original driver Steve wrote ran about 10% faster for 32-byte cache lines at 80MHz, but I believe it did this by significantly under-refreshing the SDRAM under heavy load.
Also, due to the way writes are handled in the cache driver, they will run about half the speed of reads, because every write is followed by a read even if the data isn't really wanted.
One last comment -- I believe it is not too difficult to write a two-cog reader/writer that would close to double the read speeds above, at the cost of using an additional cog. That would give 14.1 MiB/s for 512-byte blocks at 100MHz.
Jazzed:
I have been continuing to attempt to meet the TV driver challenge, it would be much easer if the data out from the SDRAM could be directed directly to the Composite DAC. I hope to have something that works correctly before long, though no promises .
Many, many questions:
How much?
How many?
From you or from GG?
Bare boards or populated? with or without headers mounted?
Now that I've asked, of course I have preferences. I could probably use at least two populated, without headers mounted, and several more bare boards (I have other SDRAM chips to try...).
Bare boards from me on about a 10 day turnaround would be $150 for 2 or $200 for 10.
Price schedule for board types (no headers).
Micron 64MB $80 + board cost
Micron 32MB $50 + board cost
ISSI 32MB $40 + board cost
ISSI 16MB $30 + board cost
Quimoda 16MB $25 + board cost
No Assembly: just board cost
Prices do not include shipping cost. Add a few days to turnaround time for assembly.
Let me know if you want 2 boards or 10 boards and the assembly types. After that I'll start a build.
At some point, it really makes sense to produce a board with a single Propeller, VGA, Keybd/Mouse, FTDI USB, regulators, and other support circuits that will fit in a cheap enclosure. It would make a nice Propeller-PC.
@davidsaunders, The challenge is to make generic hardware more usable for those who already own it.
From a clearout of old "junk" I have a couple of SDRAM DIMMS that were populated with MT48LC64M8A2 chips, so apart from being twice as big as your 32M8A2 quoted in the circuits (2K row addr, instead of 1K) I guess that they would work. The pin outs are the same.
From a clearout of old "junk" I have a couple of SDRAM DIMMS that were populated with MT48LC64M8A2 chips, so apart from being twice as big as your 32M8A2 quoted in the circuits (2K row addr, instead of 1K) I guess that they would work. The pin outs are the same.
Good find! Your junk drawer has value! The parts should work perfectly with a driver tweak.
Digikey wants $31 each for those chips which is more than double the 32MB price. That's the main reason an assembled 64MB board would be $90 (carrying cost and assembly are the other reasons).
I have two DIMMS from an old AMD Thunderbird MB, so that gives me 16 of them.
If I could have some faith that the hot air removal leaves them intact then I would willingly give most of them to other people ( or take your chances).
I wince at the thought of those poor defenceless chips being welded down in an oven, then some trogladite, like me, comes along and un-welds them only to try and re-weld them down again!
At least with the through hole sort they do not suffer this and can be tested easily.
Just for a giggle, place your bets on the chances of this ...
With a following wind, it might do VGA (white on blue) SD usuals and KBD (switched on the EEPROM pins) with 64 MB of SDram. It was inspired, mostly, by settled apple juice and could result in the worst waste of RAM known to man (after Windoze) in getting a Nascom, or it's ilke.
Tomorrow the build !! (inset maniacal laughter), and again, and ...
Underneath the DIPs there is enough space for the soldering of the wires, but with 0.1mm the TSOP SDRAMs just doesn't allow for it. I made it single sided and put the SDRAM on the bottom because of this.
If this board does work then I will have to do a better one that has the sockets, and regs on board. I tried to make a version of this a while ago but having the memory bit of it as separate board (for different RAM sorts) had the dissadvantage of the 2" of 40 ways IDE cable and plugs. This did affect things, so this is an experiment with 48LCxxxx only.
I have ran up the board and it gets through the walking Zeros and the walking Ones tests and the starts on the Incremental Patern test showing 33554 KB (presumably 32M ?)
A row of "w"s appears slowly and then a "r" is spat out, followed by -
No idea what goes on the but you have "Expected $0001FF02 Received $0100FF02".
I notice the difference is that the top two bytes are in reverse order. Looks like when you get out of 64K things go astray with the top word.
I have ran up the board and it gets through the walking Zeros and the walking Ones tests and the starts on the Incremental Patern test showing 33554 KB (presumably 32M ?)
It is supposed to be just as your sdram8232_3, but i only have 74HC573s and I did put the CKE straight up to VDD and CS & DQM down to VSS (because I forgot to put in the resistors).
The Prop is doing nothing else at the present. I didn't put a reset button on this one either, hence I originally missed the 64KB "sanity" test results, by time i had switched over to PST it had gone by.
Time to get the 'scope out, me thinks. I hope I can find out what is wrong as this is all a part of leaning, after all!
I am unsure about that too. I started to probe about andto get the signal from the CLK I parted the wires that i had to put on, because of the resistor thing, The CLK was always going to be an "air wire as i didn't know what sort of freqs were going to be present, and so I didn't want it to go via link after link just to get the layout.
From that point i got a good report from the initial "sanity" tests, and so i went for a 16MB test. This should have separated the banks (on my 64MB chip) but it failed so I went for a 1MB, 2MB, 4..... When I got back to 16MB it failed again (on the random pattern). So it looks as if either I have a sneeky addr problem and/or a banjaxxed chip. I mean, just what am I supposed to do with just 8MB !?!
Comments
Here is my latest version:
- Cache size can be set from 1KB up to 16KB of HUB memory
- Cache line size can be from 16 bytes up to 512 bytes.
- The maximum number of cache lines is 256
CONstant calculations in the driver will determine the "optimum" schedule for refreshes, given the number of bytes in a cache line. NOTE: For this to be correct, you need to set the MY_CLKFREQ constant in the driver to the CLKFREQ you will be using for the overall project.
In order to free up enough longs to have 256 cache tags in COG memory, I squeezed and squeezed. Now, there is not enough room for Steve's (optional) debugging code (which looks to need 9 longs) unless you lower the MAX_TAGCOUNT to 128. I might be able to find one more long to free up...
However, now there is lots of available memory for further initialization code to compute the CLKFREQ dependent refresh values at runtime (it looks like there are $C0 == 192 longs available with a 256-entry tag vector), so that will probably be my last version of this particular driver (expect for bug fixes, of course).
And I suppose I will need to give estimated speeds... and write a speed test for comparison purposes... Sigh.
Find the latest version attached:
SdramTest-bst-archive-110410-223324.zip
So you're using CTRB to gate CTRA ? That's pretty cool!
I used gating a lot in the original driver, but it never struck me to use the other counter output as a gate. :yummy:
Regarding the clock frequency, I often use a "properties.spin" file for global program constants which makes this easy.
Since you have the SDRAM throughput around 7MB/s, it seems that a great video driver that pulls data from SDRAM is more likely to succeed now. Once I have all my hardware in FAB, I'll start looking at it again.
Thanks.
--Steve
I think I got the idea from another thread that was discussing generating a fixed number of pulses using the counters -- maybe I'll go hunting for it. My revelation was from your code -- I was under the assumption that SDRAM wanted a constant clock, and your code played lots of games with the clock. I hadn't thought through that the clock was simply used to trigger internal state changes in the SDRAM, and that I could control when those transitions happened for my convenience.
But I think my real insight was seeing an 8-byte read loop with two wrlongs, and that the loop control and ramclk control could fit into the two left-over hub access delay slots for those two wrlongs, if I could control the ramclk burst length in one instruction.
Yup, that approach would work, but now that there's a lot of room for initialization code (as part of the 256 longs of the tag vector), I think I might just make it runtime-calculated.
Well, I haven't measured it yet, and I have to revise my spreadsheet with the current numbers, so 7MB/s might be a little optimistic -- as I recall, that required a 100MHz system clock, and IIRC 512 byte cache lines. Also, my estimates were for raw block I/O, not cached. Because of the way the cache code works, write speeds will be one-half the read speeds, because each cache line is read, even if it will be immediately overwritten.
I do want to put together an accurate sustained speed measurement, and also a "raw" driver that simply does the I/O to caller-supplied buffer(s) without caching, for maximum speed. But that shouldn't be too hard now.
But I don't think a one-cog driver could make it all the way to 2x the performance of the 8-bit driver, because there are no helpful instructions for gluing together the two halves of a long quickly so that you can use wrlong instead of wrword.
Double the burst rate performance means 6 system clocks per byte, or 12 system clocks per word read. But using wrword means that the fastest transfer speed would be 16 clocks per word. If you used wrlong, you have 24 system clocks to read two 16-bit words, assemble them into a long, and perform the wrlong, or in other words, you have to read D15..D0 twice, and assemble into a long, in 4 instructions. I don't see the four instructions that can do that, using uninitialized storage. I see five instructions, but that would make for interesting interleaving in the I/O routines.
So I have been testing with a GG Propeller Platform board and the GG SDRAM board, and also with a Propeller Prototype board with GGPP spaced headers mounted on it (a "ProtoPP" board) and the SDRAM board. This means everybody can play...
Also, the last time I modified the caching driver for a two-latch, 14-IO-Pin version, it seems that the send_ADDRESS routine grew by only four instructions (16 clocks) and the refresh routine grew by only one instruction, giving a "slowdown" of 20 system clocks additional per block read. For a 32-byte block, which need about 640 clocks (including overhead), that is only about a 3% slowdown. For a 256-byte block, there is less than a 1% slowdown.
Changes this time:
- The initialization code now calculates the proper refresh parameters for the current clock speed, so there's no need to edit the driver source to modify the clock speed.
- The refresh burst length calculations is a little better, so refreshes are done just a little less often (perhaps 10% less often).
I don't see much more to modify in this driver now (except for bug fixes, which I am sure will be needed). So next up is a speed test for the SdramTest.spin program, and a raw block driver built strictly for speed.And maybe a version for the 14-pin, two latch version of the hardware...
Driver zip file: SdramTest-bst-archive-110413-231745.zip
I've been very busy lately...
Please, please "make it." I saw that you wrote earlier that you were working on a design -- that would be great! It would save me a longish learning curve in board layout, and probably some early duds...
Many, many questions:
How much?
How many?
From you or from GG?
Bare boards or populated? with or without headers mounted?
Now that I've asked, of course I have preferences. I could probably use at least two populated, without headers mounted, and several more bare boards (I have other SDRAM chips to try...).
I also have a modified test program with a read speed test -- I'll post the results in a few minutes.
The driver in this version is the same as the previous version I posted -- only the SdramTest.spin program is different. It contains a Speed Test, that reads all of the SDRAM memory as fast as possible (without data verification). It uses another PASM cog to do the reads, with the next read submitted in the next hub access window after determining the previous read is done. That cog also does the clock-level timing, while the Spin code simply reports the results.
Here are the speed test results: By comparison, the original driver Steve wrote ran about 10% faster for 32-byte cache lines at 80MHz, but I believe it did this by significantly under-refreshing the SDRAM under heavy load.
Also, due to the way writes are handled in the cache driver, they will run about half the speed of reads, because every write is followed by a read even if the data isn't really wanted.
One last comment -- I believe it is not too difficult to write a two-cog reader/writer that would close to double the read speeds above, at the cost of using an additional cog. That would give 14.1 MiB/s for 512-byte blocks at 100MHz.
SdramTest-bst-archive-110419-202546.zip
I have been continuing to attempt to meet the TV driver challenge, it would be much easer if the data out from the SDRAM could be directed directly to the Composite DAC. I hope to have something that works correctly before long, though no promises .
Price schedule for board types (no headers).
Prices do not include shipping cost. Add a few days to turnaround time for assembly.
Let me know if you want 2 boards or 10 boards and the assembly types. After that I'll start a build.
At some point, it really makes sense to produce a board with a single Propeller, VGA, Keybd/Mouse, FTDI USB, regulators, and other support circuits that will fit in a cheap enclosure. It would make a nice Propeller-PC.
@davidsaunders, The challenge is to make generic hardware more usable for those who already own it.
From a clearout of old "junk" I have a couple of SDRAM DIMMS that were populated with MT48LC64M8A2 chips, so apart from being twice as big as your 32M8A2 quoted in the circuits (2K row addr, instead of 1K) I guess that they would work. The pin outs are the same.
Digikey wants $31 each for those chips which is more than double the 32MB price. That's the main reason an assembled 64MB board would be $90 (carrying cost and assembly are the other reasons).
If I could have some faith that the hot air removal leaves them intact then I would willingly give most of them to other people ( or take your chances).
At least with the through hole sort they do not suffer this and can be tested easily.
Just for a giggle, place your bets on the chances of this ...
With a following wind, it might do VGA (white on blue) SD usuals and KBD (switched on the EEPROM pins) with 64 MB of SDram. It was inspired, mostly, by settled apple juice and could result in the worst waste of RAM known to man (after Windoze) in getting a Nascom, or it's ilke.
Tomorrow the build !! (inset maniacal laughter), and again, and ...
PS 0.8mm does stretch the ironing skills.
Underneath the DIPs there is enough space for the soldering of the wires, but with 0.1mm the TSOP SDRAMs just doesn't allow for it. I made it single sided and put the SDRAM on the bottom because of this.
If this board does work then I will have to do a better one that has the sockets, and regs on board. I tried to make a version of this a while ago but having the memory bit of it as separate board (for different RAM sorts) had the dissadvantage of the 2" of 40 ways IDE cable and plugs. This did affect things, so this is an experiment with 48LCxxxx only.
I have ran up the board and it gets through the walking Zeros and the walking Ones tests and the starts on the Incremental Patern test showing 33554 KB (presumably 32M ?)
A row of "w"s appears slowly and then a "r" is spat out, followed by -
ERROR @ $0001FC00 Expected $0001FF02 Received $0100FF02
Address $0007FC00 buffer $0001D18 523 page.
(And then a listing of Cache dump)
Could this just be that I have the wrong chip 64M8 rather than the 32M8, or has the poor little thing suffered for the removal and resolderings ???
The board is taking 42mA.
TIA
Alan
ADDIT I just noticed that there are problems on the 64KB Random test
I notice the difference is that the top two bytes are in reverse order. Looks like when you get out of 64K things go astray with the top word.
The Prop is doing nothing else at the present. I didn't put a reset button on this one either, hence I originally missed the 64KB "sanity" test results, by time i had switched over to PST it had gone by.
Time to get the 'scope out, me thinks. I hope I can find out what is wrong as this is all a part of leaning, after all!
From that point i got a good report from the initial "sanity" tests, and so i went for a 16MB test. This should have separated the banks (on my 64MB chip) but it failed so I went for a 1MB, 2MB, 4..... When I got back to 16MB it failed again (on the random pattern). So it looks as if either I have a sneeky addr problem and/or a banjaxxed chip. I mean, just what am I supposed to do with just 8MB !?!
Now where did I put that wood plane.
PS Sorry I forgot the focus .
PPS 12MB failed too