PropXMM-D40 Module - Up to 1MB 5.0MB/s random > 5.33MB/s burst? SRAM

Mike Huselton · 2009-06-03 22:08

Yes, I had the counter idea also. I tend to forget the two timers/clocks just itching for something to count.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
JMH

jazzed · 2009-06-04 00:43

Unfortunately, using counters cost more than just incrementing the address by one instruction cycle. It would be fine if one could garantee an even number of instructions between reads [noparse]:)[/noparse] Using counters in the 2 COG example works fine because there is write window drift tolerance.

The random read/write numbers for a 1 COG solution appear to be 4.2MB/s and 3.0MB/s (burst rates are not known at this time, but should be higher as descibed in the post above). I'll post the 1 COG demo code later.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
--Steve

Propalyzer: Propeller PC Logic Analyzer
http://forums.parallax.com/showthread.php?p=788230

jazzed · 2009-06-07 07:07

I now have my old XMM C code working with the PropXMM-D40 hand wired prototype using 1 COG. Performance is a little better than in the old pin hungry design ... on one cycle ... aaargh! ... average performance is about the same as before. I plan to make more progress on the demo for UPEW before integrating the 2 COG design.

Attached is a Propalyzer snapshot showing 16 Propeller·pins used by the D40 module (plus pin 20 which is a debug trigger only).

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
--Steve

Propalyzer: Propeller PC Logic Analyzer
http://forums.parallax.com/showthread.php?p=788230

Post Edited (jazzed) : 6/7/2009 7:31:25 AM GMT

Ron Sutcliffe · 2009-06-07 07:27

Can't wait[noparse]:)[/noparse]

Ron

jazzed · 2009-06-07 21:30

Well I can use the TV_Text module to print to the screen using C XMM with ICCPROP.

I have some wierd issues though. The most disturbing thing is the main LMM COG code pin state settings are affected by the XMM fetch because the same COG that is running the LMM is hosting the pin registers ... this is very undesirable to me (this was not an issue with my other board that used separate address and data pins). So to get around this problem, I think it's time to try different approach for instruction fetch and execute kernel services.

Since I can fetch up to 16 instructions at a time in a block transfer from XMM without re-latching, 16 longs is a good starting point for cache block implementation using PropXMM-D40 memory design. Although 16 instructions is relatively small, I doubt that it will be a big deal especially considering the way most code I've seen constructed for Propeller is used. The extra COG cycles and space can be used to pull in or save other instruction blocks based on address statistics. Guess I have some reading to do now on other approaches to achieve the best fit solution [noparse]:)[/noparse]

I have a clear idea of the constraints using ICCPROP and will continue that path.

@RossH,
I would like to hear from you on what might be reasonable to do with Catalina since I have no momentum there yet. You had mentioned using a COG as a cache before. Since I really have no choice if I want a good demo, I should also know Catalina's constraints if any exist on potential implementations.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
--Steve

Propalyzer: Propeller PC Logic Analyzer
http://forums.parallax.com/showthread.php?p=788230

Post Edited (jazzed) : 6/7/2009 9:43:40 PM GMT

hinv · 2009-06-07 21:45

So are you saying that with XMM and LMM you can keep 2 cogs busy with larger than 512longs of memory?

jazzed · 2009-06-07 22:14

@hinv, In so many words yes, but that is not what I was "trying" to say. In the simplest terms 1 COG is the LMM interpreter and 1 COG is the code fetch and 2KB (512 long) temporary store or "cache." The LMM interpreter in ICC already has some data-cache ability. The thing we want to do is keep the LMM interpreter COG running as fast as possible.

With an efficient design of the instruction cache, LMM will run at normal speed near 6+MIPS (rdlong window miss costs about 150ns) some of the time with % To Be Determined (TBD) except for when a new block is required. Because of a cache miss the LMM interpreter will stall to some (TBD) fetch rate while a "cache line" is being filled. Using some history, the instruction cache can keep the most used "cache lines" in the Propeller for better performance. We'll see.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
--Steve

Propalyzer: Propeller PC Logic Analyzer
http://forums.parallax.com/showthread.php?p=788230

Cluso99 · 2009-06-08 03:13

Nice work Steve

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Links to other interesting threads:

· Home of the MultiBladeProps: TriBladeProp, SixBladeProp, website (Multiple propeller pcbs)
· Single Board Computer:·3 Propeller ICs·and a·TriBladeProp board (ZiCog Z80 Emulator)
· Prop Tools under Development or Completed (Index)
· Emulators: Micros eg Altair, and Terminals eg VT100 (Index)
· Search the Propeller forums (via Google)
My cruising website is: ·www.bluemagic.biz·· MultiBladeProp is: www.bluemagic.biz/cluso.htm

jazzed · 2009-06-09 04:37

Continuing to pollute the forum [noparse]:)[/noparse]

It's obvious now that my inter-cog cache hit performance assumptions were terribly wrong and would be worse than if I just use a 2 COG random read solution because of cache hit detection instruction cycles. An intra-COG one line cache will be the next approach to try, but there is little room for that in the kernel.

Barring any progress on the cache front, I'll look at using a separate read COG to overcome the pin interference problem.

Doing cache experiments today have had value though. I'm seeing a 21% increase of performance with read bursts over random read with 1 COG (4.84MB/s -vs- 4MB/s). The 2 COG read burst is still unknown but should be better than the 1 COG performance increase since 4 less instructions are used for the actual read.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
--Steve

Propalyzer: Propeller PC Logic Analyzer
http://forums.parallax.com/showthread.php?p=788230

jazzed · 2009-06-18 21:08

The PropXMM-D40 prototype Fabs (PC boards) are here. I had discovered some bugs in the design after sending out for Fabs. With some rework, I have verified basic function and will start a new production quality Fab order this week. I will accept pre-order commitments at UPEW or by response to this thread for anyone interested.

"It will be ready before Prop-II" [noparse]:)[/noparse] To be more specific, new Fabs and assemblies should be available in August.

FirstFab Prototypes are deficient in a few ways. 1. Need WireWrap size pads/holes for interconnect, 2. Need series resistors on data bus, 3. Output Enable on SRAM chip needs to be connected to Address Latch. Otherwise, everything works as expected.

The interconnect requires WireWrap size pin pads/holes so that a right-angle breakaway SIP header can be used. Also, this provides for a 40-pin DIP sandwich for DIP40 stacking.

Added: Shot of one possible stackable solution is shown. Other variations are possible, and I've just discovered a secret that will work well with the upcoming Propalyzer hardware DIP40 module [noparse]:)[/noparse]

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
--Steve

Propalyzer: Propeller PC Logic Analyzer
http://forums.parallax.com/showthread.php?p=788230

Post Edited (jazzed) : 6/19/2009 12:54:41 AM GMT

RossH · 2009-06-19 02:37

Hi Jazzed,

Nice work. Are you planning on selling them assembled? If so, put me down for one.

I'm not yet advanced enough in my own XMM support to explore the idea of using a cog as an XMM cache - I have a brilliant idea on how to do this, but I'm not willing to share it yet - just in case it also turns out to be a completely stupid idea

This week I got bogged down converting my EMM and XMM loaders to load the kernel from the target instead of compiling it into the application program each time. As you pointed out in the Catalina thread, this will make it much easier to support multiple XMM implementations. I won't post it generally just yet becasue I think you and I are probably the only two people playing with this at the moment - but if you want the update, let me know via PM. BTW did you figure out your I2C problem? If not, you may want to check both the PAGE_SIZE and MAX_KERNEL_LONGS constants. I got caught with the latter being incorrect myself, which gave symptoms similar to the ones you described - I now think MAX_KERNEL_LONGS should just be fixed at $1F0 in all cases.

Interesting that someone started a thread on debugging today. I first thought debugging PASM was hard. Then I thought that was easy but debugging LMM was hard. Now I think LMM is a doddle but debugging XMM is hard. If anyone comes up with another brilliant memory management idea, I think I'll just shoot them.

Ross.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Catalina - a FREE C compiler for the Propeller - see Catalina

jazzed · 2009-06-19 05:44

Hi Ross,

I'll offer a bare PC board with parts-list and schematic, a SIP20 assembled version, and a DIP40 assembled version. Do you have any hardware other than Hydra? The pin-out for this design would not work very well with Hydra .... Of course you could just put a Propeller on top of the PropXMM-D40 module and run the Propeller in RCFAST mode by downloading code to Propeller HUB ram ... but that would be 1/4th normal speed and your loader would have to get code from the serial port.

Thanks for the tip on the Catalina code. I've been up to my ears in hardware all week. Good news is I now have a PropXMM-D40 compatible Propeller board with 128KB DIP8 EEPROM so I can test your XMM build without a loader.

I think an instruction cache would speed things up a little if it can be done in the same cog with the LMM kernel. The biggest burst my SRAM design will do is 16 longs anyway ... it is obvious this would not be a traditional cache design though. The problem I had with cache on another cog was rdlong/wrlong latency ... missing the window is a performance disaster. Having inter-cog messaging really hurts. Even with latching the address, a random read is faster than any inter-cog cache that I came up with.

I'm familiar with your "bog" [noparse]:)[/noparse] I have a hex loader working with ICC, but I'm finding that the need to reserve space in hub 0+ for the new code is a little trouble with conventional LMM. I have an FdSerial XMM demo that loads and runs fine and one TV XMM program that partially runs when loaded from conventional LMM. The solution will have to be done in PASM given my limits today or get the loader to load in the upper area of hub memory. A PASM loader might be more inter-compiler portable anyway ... that is likely one of my next steps after I get something running on Catalina.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
--Steve

Propalyzer: Propeller PC Logic Analyzer
http://forums.parallax.com/showthread.php?p=788230

RossH · 2009-06-19 06:03

Hi Jazzed,

I have several boards in addiiton to my trusty Hydra already, and some more coming. It's finding time to support them all that is becoming the problem. I don't even know the specs of all the them yet, but I'm sure that on one or the other of them I can find a use for one of your modules. Put me down for a DIP40 assembled version.

I don't think there is much hope of a Catalina kernel ever having room to have caching on the same cog - but (as you mention) the problem with having it on a separate cog is the inter-cog communication overhead. My "brilliant" (or possibly "stupid") idea concerns exactly this issue.

And I agree that an inter-operable PASM loader would be the way to go. When we're both a bit further (and also when Bill Henning is a bit further on with Largos - probably after UPEW) let's see if we can come up with something more generic.

Ross.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Catalina - a FREE C compiler for the Propeller - see Catalina

PropXMM-D40 Module - Up to 1MB 5.0MB/s random > 5.33MB/s burst? SRAM

Comments