Adding 32MB SDRAM to Propeller ...

Nick McClick · 2010-10-27 10:08

@Raven - Yep, these are version A1. I'm hoping we'll have them in quantity in about a month.

@David - Yeah, wouldn't work with a demoboard or C3. But it will work with a protoboard. I think it would also work on a breadboard. It is easiest, though, with a Propeller Platform.

jazzed · 2010-10-27 10:14

Bill, why did you post this? Ravenkallen did not mention C3.
Seems like the misunderstanding is somewhere else.
More coffee?

Bill Henning wrote: »

Ravenkallen, I did not take offense!

I just wanted to clear up your apparent misunderstanding that large memory programs will "finally" be possible on the C3... Cluso and I have had products capable of doing that for a LONG time, Dr_Acula has Dracblade, there is a three-prop design in Germany, Steve is bringing out a 32MB board...

As for graphics, Morpheus has been doing high resolution bitmapped VGA graphics for about 2.5 years now (and has been available for sale since June '09 UPEW)

I am afraid that C3 does not have the bandwidth to do high resolution VGA graphics, however bitmapped TV graphics from SPI ram is barely possible on C3, PropCade, and other SPI-ram utilizing designs.

Bill Henning · 2010-10-27 10:25

More coffee is required...

What I wrote was NOT what I intended to write - not fully awake yet!

I have corrected the post

What I was trying to say is that high-rez graphics needs more than single channel SPI ram, and what I was correcting was large memory "finally" being available...

Thanks for the headsup - I like the C3 design, I will probably get one for plane rides.

I also like your board - I may get one to experiment with, easier than hand soldering a TSOP-II 54

jazzed wrote: »

Bill, why did you post this? Ravenkallen did not mention C3.
Seems like the misunderstanding is somewhere else.
More coffee?

jazzed · 2010-10-27 11:02

Bill Henning wrote: »

Steve's board should allow for 4 pin high res VGA :-)

The VGA section on the SDRAM board is designed to use 8 Propeller pins, that is RGB+VH Sync. The 8 pin interface sacrifices being able to use the SDCARD of course. One could hack the VGA to create a 4 pin interface that would allow using SDCARD, but I honestly don't see much value in monochrome VGA over color TV.

The TV variation of the SDRAM board is the normal TV Video DAC + Audio channel and allows using the SDCARD.

Regarding graphics, there is no proof yet that the SDRAM board can be used for richly textured video yet. The SDRAM can be accessed at 5.3MBytes/s or 42.4Mbit/s so that is certainly enough bandwidth.

The richly textured video project is certainly on my list, but I have other things to do before that.

BTW, SDRAM is much cheaper than SRAM for the same amount of memory.

Bill Henning · 2010-10-27 11:29

Obviously 8 pin VGA is better

5.3MB/sec should allow for 640x480x60hz x 4 color, but with slow screen updates as 640x480x60x4col = 4.6MB.sec

which is still pretty good!

For TV, 320x192x60x136 colors = 3.67MB/sec, with fairly slow update

BUT

160x192x136 colors is only 1.84MB/sec, which would allow for good screen update rates!

And yeah, SDRAM is MUCH cheaper than SRAM - I think your SDRAM 32MB chip is roughly the same cost (within factor of 2) as the 0.5MB SRAM's.

jazzed wrote: »

The VGA section on the SDRAM board is designed to use 8 Propeller pins, that is RGB+VH Sync. The 8 pin interface sacrifices being able to use the SDCARD of course. One could hack the VGA to create a 4 pin interface that would allow using SDCARD, but I honestly don't see much value in monochrome VGA over color TV.

The TV variation of the SDRAM board is the normal TV Video DAC + Audio channel and allows using the SDCARD.

Regarding graphics, there is no proof yet that the SDRAM board can be used for richly textured video yet. The SDRAM can be accessed at 5.3MBytes/s or 42.4Mbit/s so that is certainly enough bandwidth.

The richly textured video project is certainly on my list, but I have other things to do before that.

BTW, SDRAM is much cheaper than SRAM for the same amount of memory.

jazzed · 2010-11-04 21:50

Here's an SdramCache.spin that allows using cache more effectively.
The dirty bit logic works now and I've added a LINELEN_256 option.
One of my latest optimizations fails, this copy should work.

jazzed · 2010-11-05 11:35

Here is a Keyboard demo that uses a variation of Baggers' TV_Text_Half_Height.spin for video. The demo handles many of the keyboard control keys such as arrows for maneuvering around the screen. This demo is designed to work with the Keyboard in the Mouse port for the atTiny85 based board labeled "GG_SDRAM REV A1".

This is also usable with other TWI I2C compliant modules with a device number change. Normally the PS2 Mouse connector is selected as ID0/1 and Keyboard selected as ID2/3. In the attached version, the Keyboard is defined to use the Mouse connector ID0/1.

--Steve

Oldbitcollector (Jeff) · 2010-11-09 12:19

I've been quietly lurking this thread for a while...

What has become of the board that Jazzed/Nick are working on. Updates?

OBC

jazzed · 2010-11-09 13:46

@OBC, we're almost ready for production orders.

Our "pilot builds" start this week. We should have our first manufacturing runs a few days after pilot.

Look for a few new threads announcing availability and product details including software and demo videos.

Back to demo building ....

Oldbitcollector wrote: »

I've been quietly lurking this thread for a while...

What has become of the board that Jazzed/Nick are working on. Updates?

OBC

hinv · 2010-11-09 19:41

Any chance you will do a tiny matchbox sized board like closo's?
With the C3 free pins and and connector, it could make a nice add on.

just my 10 bits worth,

Doug

jazzed · 2010-11-09 20:16

Hi Doug,

A matchbox size could be done, but I'm not sure about what I/O configs are possible.

I've planned to do a single board computer all along, but that will be later after other developments. For example, I have a solution running on my desk that provides VGA+SDRAM+USB/PS2+SDCARD+4 totally free I/O for the mother board, but it requires more software and it has a Propeller on it. I'll also be bringing out a Propalyzer board for the "Propeller Platform" footprint. Then there are a few secret things I've found to do

My priorities right now in order are:

get current boards order-able
get a ZOG demo running and published
get a Catalina demo running and published

hinv wrote: »

Any chance you will do a tiny matchbox sized board like closo's?
With the C3 free pins and and connector, it could make a nice add on.

just my 10 bits worth,

Doug

Oldbitcollector (Jeff) · 2010-11-10 20:55

I'm shocked that someone didn't bump this thread today with the announcement of the 32MB SDRAM module from Gadget Gangster posted today. (Perhaps Nick and Jazzed didn't want to toot their own horn, so I'm going to do it for them.)

Nice job guys! It looks like an inexpensive and easy answer to a powerful Propeller setup.

OBC

Nick McClick · 2010-11-10 21:07

Shhh... It's a secret! I'll put up a thread on it with details shortly.

jazzed · 2010-11-10 21:54

@OBC, Thanks for the toot! Your web site looks great!

--Steve

Heater. · 2010-11-10 23:12

I wonder how Dr_Jim feels about this.

Great work guys!

hinv · 2010-11-11 07:09

I was just thinking the same thing.... I don't know if this twinges your conscience or not Steve, but since this is being offered, you are diverting funds away from MIT, and consequently, we may never get artificial intelligence....

Bill Henning · 2010-11-11 08:03

Nice work guys!

Nick McClick wrote: »

Shhh... It's a secret! I'll put up a thread on it with details shortly.

Heater. · 2010-11-11 08:04

hinv,

...consequently, we may never get artificial intelligence....

That's OK we have ZOG. He's smart enough for now:)

Toby Seckshund · 2010-11-11 09:13

...consequently, we may never get artificial intelligence....

Acording to my CV it's already here.

hinv · 2011-01-27 23:00

Steve,

Maybe instead of a matchbox computer, how about a proppad type computer with an lcd, 32MB or memory, microsd in a really portable form factor.

Just putting in my 10 bits.

Doug

jazzed · 2011-01-28 10:19

hinv wrote: »

Maybe instead of a matchbox computer, how about a proppad type computer with an lcd, 32MB or memory, microsd in a really portable form factor.

Later I could make an LCD version of MicroProp PC (uPropPC) ... uPropPad ???
How big do you want the LCD?

Ding-Batty · 2011-04-02 12:47

I have been working with the GG SDRAM board, and Jazzed's drivers. They are not quite what I need, but close -- I will eventually want a 14-pin SDRAM interface (two address latches instead of one), without caching to maximize raw sustained throughput.

So here is my first "checkpoint" of changes to the SDRAM driver:

By reorganizing the initilaization code, and moving it into the cog memory for the tag vector, I was able to increase the number of cache lines from 64 to 128.
Reworked the various definitions, to "clean up" some confusion (perhaps only my own) about byte vs. long sizes of things. Also, now the code determines the "best" number of refreshes to do between I/O block transfers at compile time.
Reworked the sdram_refresh routine to better conform to the specs for the SDRAM chip, and to make it easier to change the number of refreshes in one burst. Also, reworked the places at which the command dispatch code checks for refresh (and added a few).

Note that this does not (yet) improve the speed of the driver -- in fact, due to the changes to the refresh logic, I believe this one is a little slower, but it better matches the documented refresh requirements of the chip. Also, this driver needs to have the operating CLKFREQ hard-coded into the constant definitions so that it can calculate the "optimal" refresh burst length and number of command polling loops between refresh bursts at compile time.

I also fixed a small bug in the test program (it was clobbering address 0 in the hub -- the current clock frequency storage location).

I have further plans for more changes and improvements, that I am already in the middle of:

Revise the block read/write routines to use both counters -- this will permit the driver to use block sizes other than 32 bytes as is fixed in the current implementation. Using blocks larger than 32 bytes permits greater sustained throughput (less overhead per byte for the block transfers and refreshes).
Revise the initialization code to calculate the refresh burst length and command poll loop count at runtime from the clock frequency in long[0], so the driver better adapts to different boards with less hand configuration at compile time.
Produce a non-caching version of the driver, written for maximum throughput speed, for bulk I/O applications such as high speed data collection and video generation.
Produce a two-address-latch version, which will then only use 14 Propeller I/O pins -- this should only decrease the maximum speed by at most 3% (estimated), which seems very desirable for some applications. More free I/O pins is a Good Thing.

Lastly, a few estimates about throughput. This version of the driver, using 32 byte cache blocks, should be able to read up to about 3 MB/s sustained, using an 80MHz system clock. Note that write speeds are half of that, because every write is accompanied by a read (cache line flush and replacement), even if that data will be completely overwritten. But my current estimates for larger cache line sizes shows that going to a 64-byte cache line gives a sustained speed of 4.3 MB/s, a 128 byte cache line gives 5.2 MB/s, and a 256-byte cache line gives a 5.8 MB/s sustained read speed, all for an 80MHz clock. In fact, at a 100 MHz system clock, using a 512-byte cache line gives a speed of 7.7 MB/s, and a 1024-byte block gives just about 8 MB/s.

All those times are speed from SDRAM to HUB memory buffers.

Current version: SdramTest-bst-archive-110401-000817.zip

jazzed · 2011-04-02 17:29

Ding-Batty wrote: »

I have been working with the GG SDRAM board, and Jazzed's drivers. They are not quite what I need, but close -- I will eventually want a 14-pin SDRAM interface (two address latches instead of one), without caching to maximize raw sustained throughput.

So here is my first "checkpoint" of changes to the SDRAM driver:

@Ding-Batty,

Your enhancements look great. You've put a lot of good work into the driver. Thanks a bunch for publishing this.

There is only one place where I have a question. At the beginning of the cache command loop, there are extra refresh instructions commented at "Initial refresh check:" which cause another HUB window miss. It looks like the instructions could be folded into sdram_refresh some way.

The sdram cmdloop needs to be as short as possible for determining a cache hit; maybe you can find a way to move the instructions?

sdramDone
cmdloop             djnz    refresh,#norefresh  ' check refresh every time ... djnz here keeps window
                    call    #sdram_refresh      ' if refresh timer expired, reload and refresh

norefresh           rdlong  addr,cmdptr    wz   ' get command/address ... command is in bit 0..1
    if_z            jmp     #cmdloop            ' if zero, do nothing

                    ' Initial refresh check:
                    ' If running through the cache hit/miss check pushes us
                    ' past the time we should perform a refresh, we do one
                    ' early right here.
                    ' This accounts for the time until we get through this
                    ' routine with a cache hit: this is about 23 instructions,
                    ' or about 6 loops.
                    ' sub   refresh, # initial_check_overhead wc, wz
                    sub   refresh, #6   wc, wz
    if_c_or_z       mov   refresh, #0
    if_c_or_z       call  #sdram_refresh

                    mov     clineptr,addr       ' get the cache line pointer offset from address
.....

I look forward to seeing details of your 14 pin hardware interface if you're willing to share that.

Thanks again.
--Steve

Ding-Batty · 2011-04-03 11:16

jazzed wrote: »

I look forward to seeing details of your 14 pin hardware interface if you're willing to share that.

Nothing special about the design -- it has two 74HC573's (or LVT) instead of one. The first is for A0..7, the second is for A8..12 and BA0, BA1. The thing to note is that the only time the high address bits have to be latched is while data is being actively read or written. The rest of the time that latch can be transparent, so access to the bank bits and A10 can be done directly through the P0..P7.

There is very little that needs to change in the driver: the send_Address routine, called at the start of both the block read and write routines, needs a few more instructions to prepare the high address bits for writing, and to latch the high address bits, but since the data address auto-increments during the transfer, that is the only cost (and other trivial changes).

I actually had a version of the hardware wired up on a protoboard, with a version of the original driver along with the small modifications needed, and it worked fine, except that I was getting "phantom" SDRAM clock pulses when writing a $00 after a $FF in a burst write

So, I'm going to have to climb the DipTrace/Eagle learning curve to lay out a board with the SDRAM chip and two latches (and bulk and bypass caps, and pull-ups, etc.) to go further with that. But then I get to change some of the pin assignments, which should save some memory, and possibly give a little speed-up as well.

The pin assignments I expect to use:

P0..P7 connect to SRDAM D0..D7, and the data input of the two latches
P20 is ALE_LO, for the A0..A7 latch
P21 is ALE_HI, for the A8..A12,BA0,BA1 latch
P22 is SDRAM CLK
P25 is RAS*
P26 is CAS*
P27 is WE*

This pin assignment has the following properties:

Data on P0..P7 is the fastest
RAS*, CAS* and WE* on P25..P27, with no other control pins "nearby"; allows us to use the movs instruction to set them for SDRAM commands directly, without needed cog storage for the command values as in the current driver
P8..P15 are free for VGA Video
P8..P19 are free for three differed TV pin groups
P9..P17 are free for any particular application that might be able to use the movd instruction for controlling specific hardware more efficiently, as the SDRAM driver uses "movs outa,data".

More on the refresh issues in following posts.

jazzed · 2011-04-03 11:32

Your pin assignments look good. I'll post a modified PropellerPlatform SDRAM schematic later to save you some time.

Oldbitcollector (Jeff) · 2011-04-03 11:40

14 pins instead of using most of them... Now we're talking...

OBC

Ding-Batty · 2011-04-03 11:40

jazzed wrote: »

Your pin assignments look good. I'll post a modified PropellerPlatform SDRAM schematic later to save you some time.

Thanks! I take it you use Eagle, from your comments in other threads?

And I'm in the middle of a long-ish discussion about the command loop and refreshes...

Ding-Batty · 2011-04-03 12:35

jazzed wrote: »

@Ding-Batty,

Your enhancements look great. You've put a lot of good work into the driver. Thanks a bunch for publishing this.

There is only one place where I have a question. At the beginning of the cache command loop, there are extra refresh instructions commented at "Initial refresh check:" which cause another HUB window miss. It looks like the instructions could be folded into sdram_refresh some way.

The sdram cmdloop needs to be as short as possible for determining a cache hit; maybe you can find a way to move the instructions?

Thanks -- I do love a challenge. Or rather... I do get quite obsessed over a good challenge

Of course, you are right about my slowing down the cache hit command loop -- I was focusing on getting the refresh bursts scheduled more "to spec", and not really thinking about the command efficiency in detail. But have gone back and reviewed the original code in the command loop.

First, a short digression (I do tend to be long-winded -- you have been warned!):

My approach to refresh scheduling was to assume that worst-case we need to follow the datasheet recommendations for refresh, assuming that in some possibly rare cases, delaying refreshes may lead to data loss. I seem to recall reading some posting, perhaps here, that observed that DRAM is fairly forgiving in refresh, and even delaying refreshes by a factor of two did not cause any problems. But I'm not sure how much I'd trust that for any kind of general-use product.

So, my design principles for refresh:

Refreshes are done in (small) bursts, to give more time between the bursts for blocks of I/O
Refreshes are distributed evenly
If performing an I/O operation would delay a refresh past its schedule, it is better to do the refresh early than late.

So, using this approach, I added refresh tests at the start of code sequences that might delay a refresh: before a read block, before a write block, and before a cache hit check. Now, as you pointed out, the cache hit code is actually fairly time-sensitive, since that should be the most frequent use of a caching driver, especially for emulations (such as ZOG) or extended memory programs, such as generated by Catalina. I admit that my interest is in raw block transfer speeds, so I started a little biased.

So I took a closer look at the original code, and saw how the rdlong and the following wrlong instructions were properly timed for hitting the HUB access time slots on a cache hit, which I missed before, and I see how that would make a big difference in the performance of a caching driver.

But I also noticed that the full command cycle timing from one "rdlong addr,cmdptr" to the next actually misses the "best" hub access slot by 4 clocks (one instruction) in the original code. In the original code this could have been fixed as follows:

if_e            wrlong  zero,cmdptr         ' if match let user know we're done early ...
                                                ' we rely on the fact that the user is looking for 0 cmdptr
                                                ' and must use another HUB access to get data.
                                                ' prolems may come if the user's cog is lower than this cog number

                    and     clineptr,_CLP_MOD   ' user sends full address. we only load blocks
                    add     clineptr,cacheptr   ' get cache line
                    wrlong  clineptr,datptr     ' send cache buffer to user - data may change before we're done
    [COLOR=red]if_e            jmp       #writeTag[/COLOR]     ' added this to preserve hub access timing

                    call    #flush              ' if bad tag, flush - cache code never changes Z flag
                    wrlong  zero,cmdptr         ' let user know we're ready after flush ...

writeTag            movd    tagit,readtag

But I wanted to add at least two more instructions to this command loop, to check whether a cache hit would delay a refresh. Originally I added them at the start of the loop, but after your comment, I saw that I could instead add them at the end of the loop. This would delay that particular refresh by up to 80 system clocks (1us at 80MHz), but done right would not delay any later refreshes.

So I reworked the command code -- I unrolled the loop to eliminate one of the jumps, and replaced the djnz logic with different refresh check logic specifically for right after a cache hit.

This seems to accomplish all the desired results:
The cache hit command loop is now 16 clocks shorter than the original logic, at the expense of making the cache miss logic a little longer.
The refresh logic detects that a refresh burst was delayed by a cache hit, and does the refresh immediately.

Unfortunately, I think the code is now somewhat ugly, and any suggestions for improvements would be welcome. Code archive is attached:
SdramTest-bst-archive-110403-132008.zip

jazzed · 2011-04-03 14:05

Good work

It's hard to beat detecting a cache hit in 12 clock cycles.
I'll do some testing with your driver later.

A 14pin schematic.pdf is attached. I've almost finished routing a board.

Ding-Batty · 2011-04-07 21:37

No code to post yet, but I just got the new write loop working: it can handle any cache line length that is a multiple of 4 bytes, with the loop taking just 12 system clocks per byte written. And it frees up about 56 longs in the cog.

Once I have the block read code rewritten as well (it will be a little harder, as there is one more instruction to fit into the loop, and handling the CAS latency makes stopping a little harder) I will post a new version of the code.

Right now the number of cache line tags is limited to 128 in the cog. I don't think that I can save enough on the block read code to find 128 more longs to increase the number to 256 -- it would be nice, but I don't think I'll be able to find that many instructions or variables to eliminate.

And thanks for the schematic -- my work on that portion of the project is going to have to wait about a month, due to taxes, two concerts, one show for my wife, and a few other important matters.

Adding 32MB SDRAM to Propeller ...

Comments